![Page 1: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/1.jpg)
1
Class #3: Clustering
ML4Bio 2012 January 27th, 2012
Quaid Morris
![Page 2: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/2.jpg)
2 Module #: Title of Module
![Page 3: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/3.jpg)
Overview • Objective functions • Parametric clustering (i.e. weʼre estimating
some parameters): – K-means – Gaussian mixture models
• Network-based (non-parametric clustering, no “parameters” estimated): – Hierarchical clustering – Affinity propagation – Graph (i.e., network) cuts – MCL 3
Advanced topics, next QM tutorial
![Page 4: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/4.jpg)
Objective functions
4
An objective function, e.g., E(Θ), measures the fit of one or more parameters, indicated by Θ, to a set of observations.
By maximizing the objective function, we can do estimation by finding the settings of the parameters that have the best fit.
Likelihood and log likelihood are examples of common objective functions.
![Page 5: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/5.jpg)
Notes about objective functions
5
Beware: Sometimes you are supposed to minimize objective functions (rather than maximizing them) but in those cases, the objective function is usually called a cost function or an error function.
Note: you can always turn a minimization problem into a maximization one by putting a minus sign in front of the cost function!
similarity = - distance
![Page 6: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/6.jpg)
Examples of objective functions: log likelihood
• Estimating the bias of a coin, p, given that weʼve observed m heads and n tails.
Pr(m heads and n tails | bias of p) =
6
![Page 7: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/7.jpg)
Examples of objective functions: log likelihood
• Estimating the bias of a coin, p, given that weʼve observed m heads and n tails.
Pr(m heads and n tails | bias of p) =
7
€
m + nm
pm(1− p)n
![Page 8: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/8.jpg)
Examples of objective functions: log likelihood
• Estimating the bias of a coin, p, given that weʼve observed m heads and n tails.
Pr(m heads and n tails | bias of p) =
Use log likelihood minus a constant as objective function:
Maximum likelihood (ML) estimate:
8
€
m + nm
pm(1− p)n
€
E(p) =m log p + n log(1− p)
€
pML = argmax p E(p)
![Page 9: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/9.jpg)
Examples of objective functions: sum of squared errors
• Estimating the mean of a distribution, m, given that weʼve observed samples x1, x2, …, xN.
Objective function:
Minimum sum of squared error (MSSE) estimate:
9
€
E(m) = − (i=1
N
∑ m − xi )2
€
mMSSE = argmaxm E(m)
![Page 10: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/10.jpg)
A MSSE estimate is a ML estimate!
• The MSSE is the log likelihood (minus a constant) of the data under a Normal distribution with fixed variance:
Recall:
and if xi is normally distributed with mean m and variance of 1 (i.e. σ2 = 1), then:
10
€
€
P(x1,x2,…,xN |m) = P(xi |m)i=1
N
∏
€
P(xi |m) =12πe−(xi −m)
2 / 2
![Page 11: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/11.jpg)
Sum of squared errors with vectors
• Recall from linear algebra that for vectors v, and w (where vj and wj are elements of these vectors) that their dot, or inner, product is:
• For measuring SSE between vector-valued parameters m and observations x1 , x2, etc. we use:
11
€
vTw = v jw jj∑
€
m − xii∑
2= (m − xi )
T (m − xi )i∑
Squared Euclidean distance
![Page 12: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/12.jpg)
K-means • Given a “K”, the number of clusters and a set
of data vectors x1, x2, …, xN, find K clusters defined by their means (or sometimes, “centroids”): m1, m2, …, mK
Each data vector xi, is assigned to its closest mean. Let c(i) be the cluster that xi is assigned to, so:
12
€
c(i) = argmin j xi −mj2
![Page 13: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/13.jpg)
K-means • Objective function:
sometimes E is called the distortion.
Recall:
13
€
c(i) = argmin j xi −mj2
€
E(m1,…mK ) = xi −mc(i )i∑
2
![Page 14: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/14.jpg)
Lloydʼs algorithm for K-means • The K-means objective function is multimodal
and its not exactly clear how to minimize it. Thereʼs a number of algorithms for this (see, e.g., kmeans() help in R).
• However, one algorithm, Lloydʼs is so commonly used itʼs often called “the K-means algorithm”.
sometimes E is called the distortion. 14
![Page 15: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/15.jpg)
Lloydʼs algorithm for K-means • Step 0: initialize the means
– (can do this by randomly sampling the means, or by randomly assigning data points to mean)
• Step 1: Compute the means based on the cluster assignments (“M-step”): – mj = mean of all xi such that c(i) = j.
• Step 2: Recompute the cluster assignments, c(i) based on the new means.
• Step 3: If the assignments donʼt change, then you are done, otherwise go back to Step 1.
15
![Page 16: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/16.jpg)
Another K-means algorithm that almost always works better.
(sequential K-means) • Step 0: initialize the means
– (can do this by randomly sampling the means, or by randomly assigning data points to mean)
• Step 1: Compute the means based on the cluster assignments (“M-step”): – mj = mean of all xi such that c(i) = j.
• Step 2: Recompute the cluster assignment for a random point, xi
• Step 3: If you are not done, go back to Step 1. 16
![Page 17: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/17.jpg)
Clustering example
x1
x2
Step 0
![Page 18: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/18.jpg)
Clustering example
x1
x2
Step 1 “M-step”
![Page 19: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/19.jpg)
Clustering example
x1
x2
Step 2 “E-step”
![Page 20: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/20.jpg)
Clustering example
x1
x2
Step 1 “M-step”
![Page 21: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/21.jpg)
Clustering example
x1
x2
Step 2 “E-step”
![Page 22: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/22.jpg)
Clustering example
x1
x2
Step 1 “M-step”
![Page 23: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/23.jpg)
Clustering example
x1
x2
Step 2 “E-step”
We’re done!
![Page 24: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/24.jpg)
Gaussian mixture models • Weʼre not capturing any information about the
“spread” of the data when we do K-means – this can be especially important when choosing # of means.
• Rarely can you eyeball the data like we just did to choose # of means. So, weʼd like to know if we have one cluster with a broad spread in one dimension and narrow in another or multiple clusters.
• This narrow/broad spread can happen if your dimensions are measured in different units. 24
![Page 25: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/25.jpg)
Gaussian mixture model
x1
x2
What we want
Σ2
Σ1
![Page 26: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/26.jpg)
Gaussian mixture model • Objective function:
where
26
€
E(m1,…mK,Σ1,…,ΣK ) =
P(xi m1,…mK,Σ1,…,ΣK ) |i∏
€
P(xi m1,…mK,Σ1,…,ΣK ) =
π jN(xi m j ,j∑ Σ j ) Multivariate normal
density
![Page 27: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/27.jpg)
Whatʼs Σ??? • Σ is the covariance matrix, it specifies the
shape of the distribution • If Σ = σ2 I (the identity matrix), then the
distribution is circular, with variance σ2 in every direction.
• If Σ is a diagonal matrix, then the distribution is “axis-aligned” but may be elliptical.
• If Σ is neither, then the ellipse is slanted
27
![Page 28: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/28.jpg)
Gaussian mixture model
x1
x2
Σ1 is the identity matrix
![Page 29: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/29.jpg)
Gaussian mixture model
x1
x2
Σ1 is diagonal
![Page 30: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/30.jpg)
Gaussian mixture model
x1
x2
Σ1
Σ1 is not diagonal
![Page 31: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/31.jpg)
Hierarchical agglomerative clustering
Often difficult to determine correct number of clusters ahead of time and want to group observations at different levels of resolution
dendrogram
clustergram Eisen et al, PNAS 1998
![Page 32: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/32.jpg)
Hierarchical agglomerative clustering
• Start with set of clusters : – C = {{x1}, {x2}, …, {xN}} each containing exactly
one of the observations, also assume a distance metric dij is defined for each xi and xj
• While not done: – Find most similar (i.e. least distant) pair of clusters
in C, say Ca & Cb – Merge Ca and Cb to make a new cluster Cnew,
remove Ca and Cb from C and add Cnew – done if C contains only one cluster
![Page 33: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/33.jpg)
Hierarchical agglomerative clustering
Algorithms vary on how they calculate the distance of clusters d(Ca, Cb). In all cases, if both clusters contain only one item, say Ca={xi} and Cb={xj} then d(Ca, Cb) = dij
33
![Page 34: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/34.jpg)
Hierarchical agglomerative clustering
If clusters have >1 item, then have to choose linkage criterion:
Average linkage (UPGMA): d(Ca, Cb) = mean of distances between items in clusters
Single linkage: d(Ca, Cb) = minimum distance between items in clusters
Complete linkage: d(Ca, Cb) = maximum distance between items in clusters
![Page 35: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/35.jpg)
35
![Page 36: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/36.jpg)
Drawing the dendrogram
Ca Cb
d(Ca,Cb)
This node represents Cnew
![Page 37: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/37.jpg)
Rotation of the subtrees in the dendrogram is arbitrary.
dendrogram
clustergram Eisen et al, PNAS 1998
Advanced topic: ordering the items
![Page 38: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/38.jpg)
Advanced topic: ordering the items
• When displaying the clustergram, the ordering needs to be consistent with the dendrogram (and the clustering) but there are many consistent orderings, as you can arbitrarily rotate the trees.
• Can use the “TreeArrange” (Ziv Bar-Joseph et al, 2001) algorithm to find the ordering of the items that minimizes the distance between adjacent items while being consistent with the dendrogram (and clustering).
![Page 39: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/39.jpg)
Affinity propagation(Frey and Dueck, Science 2007)
Exemplar-based clustering method, i.e. the cluster centre is on one of the datapoints. Also, “automatically” chooses the number of cluster centres to use. Requires the similarities, sij, for each pair of data items xi and xj.
![Page 40: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/40.jpg)
Affinity propagation • Objective function (for similarities si,j):
Ε(c) = Σi si,c(i) where c(i) is the index of the centre that data item xi is
assigned to. • The self-similarities sii can determine # of
centres, e.g., if sii is less than sij for all j (not equal to i), then there will be only one centre. If sii is greater than sij, for all j, all points will be there own centre.
![Page 41: Class #3: Clustering - Moses Lab · 2015-02-24 · • Parametric clustering (i.e. weʼre estimating some parameters): – K-means – Gaussian mixture models • Network-based (non-parametric](https://reader034.vdocument.in/reader034/viewer/2022042315/5f03606d7e708231d408e957/html5/thumbnails/41.jpg)
Propagation algorithm for assessing c(i)
• Updates two sets of quantities: – rik , the responsibility of k for i – aik, the availability of k to serve as iʼs centre
rik = sik – maxkʼ | kʼ is not k (aikʼ + sikʼ) aik = min{0, rkk+ Σiʼ | iʼ is not i or k max(0,riʼk)} akk = Σiʼ | iʼ is not k max(0,riʼk)
• aik all initially start at 0 • c(i) = argmax rik + aik, once converged