hierarchical clustering - stanford...

20
Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 9 Based in part on slides from textbook, slides of Susan Holmes c Jonathan Taylor December 2, 2012 1/1 Statistics 202: Data Mining c Jonathan Taylor Part I Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan Taylor Hierarchical clustering Description Produces a set of nested clusters organized as a hierarchical tree. Can be visualized as a dendrogram: A tree like diagram that records the sequences of merges or splits. 3/1 Statistics 202: Data Mining c Jonathan Taylor Hierarchical clustering A clustering and its dendrogram. 4/1

Upload: others

Post on 22-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningWeek 9

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

December 2, 2012

1 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Part I

Hierarchical clustering

2 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Description

Produces a set of nested clusters organized as ahierarchical tree.

Can be visualized as a dendrogram: A tree like diagramthat records the sequences of merges or splits.

3 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Hierarchical Clustering

 Produces a set of nested clusters organized as a hierarchical tree

 Can be visualized as a dendrogram –  A tree like diagram that records the sequences of

merges or splits

A clustering and its dendrogram.

4 / 1

Page 2: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Strengths

Do not have to assume any particular number of clusters.Each horizontal cut of the tree yields a clustering.

The tree may correspond to a meaningful taxonomy: (e.g.,animal kingdom, phylogeny reconstruction, ...)

Need only a similarity or distance matrix forimplementation.

5 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Agglomerative

Start with the points as individual clusters.

At each step, merge the closest pair of clusters until onlyone cluster (or some fixed number k clusters) remain.

6 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Divisive

Start with one, all-inclusive cluster.

At each step, split a cluster until each cluster contains apoint (or there are k clusters).

7 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Agglomerative Clustering Algorithm

1 Compute the proximity matrix.

2 Let each data point be a cluster.3 While there is more than one cluster:

1 Merge the two closest clusters.2 Update the proximity matrix.

The major difference is the computation of proximity of twoclusters.

8 / 1

Page 3: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Starting point for agglomerative clustering.

9 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Intermediate Situation

  After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate point, with 5 clusters.

10 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

We will merge C2 and C5.

11 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

After Merging

  The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5

C3 ? ? ? ?

?

?

?

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How do we update proximity matrix?

12 / 1

Page 4: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

Proximity Matrix

We need a notion of similarity between clusters.

13 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

Single linkage uses the minimum distance.

14 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

Complete linkage uses the maximum distance.

15 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

Group average linkage uses the average distance betweengroups.

16 / 1

Page 5: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

× ×

Centroid uses the distance between the centroids of the clusters(presumes one can compute centroids...)

17 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Cluster Similarity: MIN or Single Link

 Similarity of two clusters is based on the two most similar (closest) points in the different clusters –  Determined by one pair of points, i.e., by one link in

the proximity graph.

1 2 3 4 5

Proximity matrix and dendrogram of single linkage.

18 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Distance matrix for nested clusterings

111 0.24 0.22 0.37 0.34 0.23222 0.15 0.20 0.14 0.25

333 0.15 0.28 0.11444 0.29 0.22

555 0.39666

19 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

Nested cluster representation and dendrogram of single linkage.

20 / 1

Page 6: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Single linkage

Can handle irregularly shaped regions fairly naturally.

Sensitive to noise and outliers in the form of “chaining”.

21 / 1

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (single linkage)

22 / 1

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (single linkage)

23 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Cluster Similarity: MAX or Complete Linkage

 Similarity of two clusters is based on the two least similar (most distant) points in the different clusters –  Determined by all pairs of points in the two clusters

1 2 3 4 5

Proximity matrix and dendrogram of complete linkage.

24 / 1

Page 7: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

Hierarchical Clustering: MAX

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2 5

3

4

Nested cluster and dendrogram of complete linkage.

25 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Complete linkage

Less sensitive to noise and outliers than single linkage.

Regions are generally compact, but may violate“closeness”. That is, points may much closer to somepoints in neighbouring cluster than its own cluster.

This manifests itself as breaking large clusters.

Clusters are biased to be globular.

26 / 1

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (complete linkage)

27 / 1

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (complete linkage)

28 / 1

Page 8: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (complete linkage)

29 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27

Hierarchical Clustering: Group Average

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

5

3

4

Nested cluster and dendrogram of group average linkage.

30 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Average linkage

Given two elements of the partition Cr ,Cs , we mightconsider

dGA(Cr ,Cs) =1

|Cr ||Cs |∑

xxx∈Cr ,yyy∈Cs

d(xxx ,yyy)

A compromise between single and complete linkage.

Shares globular clusters of complete, less sensitive thansingle.

31 / 1

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (average linkage)

32 / 1

Page 9: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (average linkage)

33 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Ward’s linkage

Similarity of two clusters is based on the increase insquared error when two clusters are merged.

Similar to average if dissimilarity between points isdistance squared. Hence, it shares many properties ofaverage linkage.

A hierarchical analogue of K -means.

Sometimes used to initialize K -means.

34 / 1

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (Ward’s linkage)

35 / 1

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (Ward’s linkage)

36 / 1

Page 10: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

NCI data (complete linkage)

37 / 1

Statistics 202:Data Mining

c©JonathanTaylor

NCI data (single linkage)

38 / 1

Statistics 202:Data Mining

c©JonathanTaylor

NCI data (average linkage)

39 / 1

Statistics 202:Data Mining

c©JonathanTaylor

NCI data (Ward’s linkage)

40 / 1

Page 11: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Computational issues

O(n2) space since it uses the proximity matrix.

O(n3) time in many cases as there are N steps, and ateach step a matrix of size N2 must be updated and/orsearched.

41 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Statistical issues

Once a decision is made to combine two clusters, it cannotbe undone.

No objective function is directly minimized.

Different schemes have problems with one or more of thefollowing:

Sensitivity to noise and outliers.Difficulty handling different sized clusters and convexshapes.Breaking large cluster.

42 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Part II

Model-based clustering

43 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Model-based clustering

General approach

Choose a type of mixture model (e.g. multivariateNormal) and a maximum number of clusters, K

Use a specialized hierarchical clustering technique.

Uses some criterion to determine optimal model andnumber of clusters.

44 / 1

Page 12: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Model-based clustering

Choosing a mixture model

General form of the mixture model

f (xxx) =k∑

j=1

πj f (xxx ; θj)

For multivariate normal, θj = (µj ,Σj).

The EM algorithm we discussed before assumed Σj are alldifferent in the different classes.

Other possibilities: Σj = Σ, Σj = λj · I , etc.

45 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Model-based clustering

Choosing a mixture model

Generally, we can write

Σj = cjDjAjDTj

with cjdiag(Aj) the eigenvalues of Σj with max(Aj) = 1.

The parameter cj is the size, Aj is the shape, and Dj is theorientation.

46 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Model-based agglomerative clustering

Ward’s criterion

A hierarchical clustering algorithm that merges k clusters{C k

1 , . . . ,Ckk } into k − 1 clusters based on

WSS =k−1∑

j=1

WSS(C k−1j )

where WSS is the within-cluster sum of squared distances.

The procedure merges the two clusters C ki ,C

kl that

produce the smallest increase in WSS .

47 / 1

Statistics 202:Data Mining

c©JonathanTaylor

NCI data (Ward’s linkage)

48 / 1

Page 13: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Model-based agglomerative clustering

Model-based agglomerative clustering

If Σj = σ2 · I , then Ward’s criterion is equivalent tomerging based on the criterion

−2 log L(θ, l)

where

L(θ, l) =n∏

i=1

f (xixixi ; θli )

is called the classification likelihood.

This idea can be used to make a hierarchical clusteringalgorithm for other types of multivariate normal models,i.e. equal shape, same size, etc.

49 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Model selection & BIC

Bayesian Information Criterion

After a merge, the clusters are taken as initial startingpoints for the EM algorithm.

This results in several mixture models: one for eachnumber of clusters and each type of mixture modelconsidered.

How do we choose?

This raises the topic of model selection

50 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Model selection & BIC

Bayesian Information Criterion

Suppose we have several possible modelsM = {M1, . . . ,MT} for a data set which we assume isgiven by a data matrix XXX n×p.

These models have parameters Θ = {θ1, . . . , θT}.Further, suppose that each one has a likelihood, Lj and

Θ̂ = {θ̂1, . . . , θ̂T} are the maximum likelihood estimators.

We can compare−2 logLj(θ̂j)

but this ignores how much “fitting” each model does.

A common approach is to add a penalty that makesdifferent models comparable.

51 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Model selection & BIC

Bayesian Information Criterion

The BIC of a model is usually

BIC (Mj) = −2 logLj(θ̂j) + log n ·# parameters in Mj .

The BIC can be thought of as approximating

P(Mj is correct|XXX n×p)

under an appropriate Bayesian model for XXX .

52 / 1

Page 14: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Model selection & BIC

Bayesian Information Criterion

Typically, statisticians will try to prove choosing modelwith best BIC yields “correct model”.

Some theoretical justification is needed for this, and thisbreaks down for mixture models. Nevertheless, it is stillused.

Another common criterion is AIC (Akaike InformationCriterion)

AIC (Mj) = −2 logLj(θ̂j) + 2 ·# parameters in Mj .

53 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Model-based clustering

Summary

1 Choose a type of mixture model (e.g. multivariateNormal) and a maximum number of clusters, K

2 Use a specialized hierarchical clustering technique:model-based hierarchical agglomeration.

3 Use clusters from previous step to initialize EM for themixture model.

4 Uses BIC to compare different mixture models and modelswith different numbers of clusters.

54 / 1

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data “best” model: equal shape, 2components

55 / 1

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data

56 / 1

Page 15: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data

57 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Part III

Outliers

58 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Concepts

What is an outlier? The set of data points that areconsiderably different than the remainder of the data . . .

When do they appear in data mining tasks?

Given a data matrix XXX , find all the cases xxx i ∈ XXX withanomaly/outlier scores greater than some threshold t. Or,the top n outlier scores.Given a data matrix XXX , containing mostly normal (butunlabeled) data points, and a test case xxxnew, compute ananomaly/outlier score of xxxnew with respect to XXX .

Applications

Credit card fraud detection;Network intrusion detection;Misspecification of a model.

59 / 1

Statistics 202:Data Mining

c©JonathanTaylor

What is an outlier?

60 / 1

Page 16: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Issues

How many outliers are there in the data?

Method is unsupervised, similar to clustering or findingclusters with only 1 point in them.

Usual assumption: There are considerably more “normal”observations than “abnormal” observations(outliers/anomalies) in the data.

61 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

General steps

Build a profile of the “normal” behavior. The profilegenerally consists of summary statistics of this “normal”population.

Use these summary statistics to detect anomalies, i.e.points whose characteristics are very far from the normalprofile.

General types of schemes involve a statistical model of“normal”, and “far” is measured in terms of likelihood.

Other schemes based on distances can be quasi-motivatedby such statistical techniques . . .

62 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Statistical approach

Assume a parametric model describing the distribution ofthe data (e.g., normal distribution)

Apply a statistical test that depends on:

Data distribution (e.g. normal)Parameter of distribution (e.g., mean, variance)Number of expected outliers (confidence limit, α or Type Ierror)

63 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Grubbs’ Test

Suppose we have a sample of n numbersZZZ = {Z1, . . . ,Zn}, i.e. a n × 1 data matrix.

Assuming data is from normal distribution, Grubbs’ testsuses distribution of

max1≤i≤n Zi − Z̄ZZ

SD(ZZZ )

to search for outlying large values.

64 / 1

Page 17: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Grubbs’ Test

Lower tail variant:

min1≤i≤n Zi − Z̄ZZ

SD(ZZZ )

Two-sided variant:

max1≤i≤n |Zi − Z̄ZZ |SD(ZZZ )

65 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Grubbs’ Test

Having chosen a test-statistic, we must determine athreshold that sets our “threshold” rule

Often this is set via a hypothesis test to control Type Ierror.

For large positive outlier, threshold is based on choosingsome acceptable Type I error α and finding cα so that

P0

(max1≤i≤n |Zi − Z̄ZZ |

SD(ZZZ )≥ cα

)≈ α

Above, P0 denotes the distribution of ZZZ under theassumption there are no outliers.

If ZZZ are IID N(µ, σ2) it is generally possible to compute adecent approximation of this probability using Bonferonni.

66 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Grubbs’ Test

Two sided critical level has the form

cα =n − 1√

n

√√√√ t2α/(2n),n−2n − 2 + t2α/(2n),n−2

whereP(Tk ≥ tγ,k) = γ

is the upper tail quantile of Tk .

In R, you can use the functions pnorm, qnorm, pt, qt

for these quantities.

67 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Model based: linear regression with outliers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Model based techniques

!  First build a model

!  Points which don’t fit the model well are identified as outliers

!  For the example at the right, a least squares regression model would be appropriate

!  Residuals can be fed in to Grubbs’ test.

Figure : Residuals from model can be fed into Grubbs’ test orBonferroni (variant)

68 / 1

Page 18: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Multivariate data

If the non-outlying data is assumed to be multivariateGaussian, what is the analogy of Grubbs’ statistic

max1≤i≤n |Zi − Z̄ZZ |SD(ZZZ )

Answer: use Mahalanobis distance

max1≤i≤n

(Zi − Z̄ZZ )T Σ̂−1(Zi − Z̄ZZ )

Above, each individual statistic has what looks like aHotelling’s T 2 distribution.

69 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Likelihood approach

Assume data is a mixture

F = (1− λ)M + λA.

Above, M is the distribution of “most of the data.”

The distribution A is an “outlier” distribution, could beuniform on a bounding box for the data.

This is a mixture model. If M is parametric, then the EMalgorithm fits naturally here.

Any points assigned to A are “outliers.”

70 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Likelihood approach

Do we estimate λ or fix it?

The book starts describing an algorithm that tries tomaximize the equivalent classification likelihood

L(θM , θA; l) =

(1− λ)#lM

i∈lMfM(xi , θM)

×

λ#lA

i∈lAfA(xi ; θA)

71 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Likelihood approach: Algorithm

Algorithm tries to maximize this by forming iterativeestimates (Mt ,At) of “normal” and “outlying” datapoints.

1 At each stage, tries to place individual points of Mt to At .2 Find (θ̂M , θ̂A) based on partition new partition (if

necessary).3 If increase in likelihood is large enough, call these new set

(Mt+1,At+1).4 Repeat until no further changes.

72 / 1

Page 19: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Nearest neighbour approach

Many ways to define outliers.

Example: data points for which there are fewer than kneighboring points within a distance ε.

Example: the n points whose distance to k-th nearestneighbour is largest.

The n points whose average distance to the first k nearestneighobours is largest.

Each of these methods all depend on choice of someparameters: k, n, ε. Difficult to choose these in asystematic way.

73 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Density approach

For each point, xxx i compute a density estimate fxxx i ,k usingits k nearest neighbours.

Density estimate used is

fxxx i ,k =

(∑yyy∈N(xxx i ,k)

d(xxx i ,yyy)

#N(xxx i , k)

)−1

Define

LOF (xxx i ) =fxxx i ,k

(∑

y∈N(xxx i ,k)fy ,k)/#N(xxx i , k)

74 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Density-based: LOF approach

!  For each point, compute the density of its local neighborhood !  Compute local outlier factor (LOF) of a sample p as the

average of the ratios of the density of sample p and the density of its nearest neighbors

!  Outliers are points with largest LOF value

p2 ! p1

!

In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers

Figure : Nearest neighbour vs. density based

75 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Detection rate

Set P(O) to be the proportion of outliers or anomalies.

Set P(D|O) to be the probability of declaring an outlier ifit truly is an outlier. This is the detection rate.

Set P(D|Oc) to the probability of declaring an outlier if itis truly not an outlier.

76 / 1

Page 20: Hierarchical clustering - Stanford Universitystatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/week… · Hierarchical clustering 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Outliers

Bayesian detection rate

Bayesian detection rate is

P(O|D) =P(D|O)P(O)

P(D|O)P(O) + P(D|Oc)P(Oc).

The false alarm rate or false discovery rate is

P(Oc |D) =P(D|Oc)P(Oc)

P(D|Oc)P(Oc) + P(D|O)P(O).

77 / 1