k-means clustering hongning wang cs@uva. today’s lecture k-means clustering – a typical...

k-means clustering

Hongning WangCS@UVa

CS 6501: Text Mining 2

Today’s lecture

• k-means clustering – A typical partitional clustering algorithm– Convergence property• Expectation Maximization algorithm

– Gaussian mixture model

CS@UVa


Partitional clustering algorithms

• Partition instances into exactly k non-overlapping clusters– Flat structure clustering– Users need to specify the cluster size k– Task: identify the partition of k clusters that

optimize the chosen partition criterion

CS@UVa


Partitional clustering algorithms

• Partition instances into exactly k non-overlapping clusters– Typical criterion

– Optimal solution: enumerate every possible partition of size k and return the one optimizing the criterion

CS@UVa

Unfortunately, this is NP-hard!Let’s approximate this!

Inter-cluster distanceIntra-cluster distance

Optimize this in an alternative way


k-means algorithm

Input: cluster size k, instances , distance metric Output: cluster membership assignments 1. Initialize k cluster centroids (randomly if no domain

knowledge available)2. Repeat until no instance changes its cluster

membership:– Decide the cluster membership of instances by assigning

them to the nearest cluster centroid

– Update the k cluster centroids based on the assigned cluster membership

CS@UVa

Minimize intra distance

Maximize inter distance


k-means illustration

CS@UVa



CS@UVa

Voronoi diagram



CS@UVa


Complexity analysis

• Decide cluster membership

• Compute cluster centroid

• Assume k-means stops after iterations

CS@UVa

Don’t forget the complexity of distance computation, e.g., for Euclidean distance


Convergence property

• Why k-means will stop?– Answer: it is a special version of Expectation

Maximization (EM) algorithm, and EM is guaranteed to converge

– However, it is only guaranteed to converge to local optimal, since k-means (EM) is a greedy algorithm

CS@UVa


Probabilistic interpretation of clustering

• The density model of is multi-modal• Each mode represents a sub-population– E.g., unimodal Gaussian for each group

CS@UVa

Mixture model

𝑝 (𝑥 )=∑𝑧

𝑝 (𝑥|𝑧 )𝑝 (𝑧)

Unimodal distribution

Mixing proportion

𝑝 (𝑥∨𝑧=1)

𝑝 (𝑥∨𝑧=2)

𝑝 (𝑥∨𝑧=3)



• If is known for every – Estimating and is easy• Maximum likelihood estimation• This is Naïve Bayes

CS@UVa

Mixture model

𝑝 (𝑥 )=∑𝑧

𝑝 (𝑥|𝑧 )𝑝 (𝑧)


Mixing proportion

𝑝 (𝑥∨𝑧=1)

𝑝 (𝑥∨𝑧=2)

𝑝 (𝑥∨𝑧=3)



• But is unknown for all – Estimating and is generally hard

– Appeal to the Expectation Maximization algorithm

CS@UVa

Mixture model

𝑝 (𝑥 )=∑𝑧

𝑝 (𝑥|𝑧 )𝑝 (𝑧)


Mixing proportion

𝑝 (𝑥|𝑧=1 ) ?

𝑝 (𝑥|𝑧=2 ) ?

𝑝 (𝑥|𝑧=3 )?

Usually a constrained optimization problem


Recap: Partitional clustering algorithms

• Partition instances into exactly k non-overlapping clusters– Typical criterion

– Optimal solution: enumerate every possible partition of size k and return the one optimizing the criterion

CS@UVa

Unfortunately, this is NP-hard!Let’s approximate this!

Inter-cluster distanceIntra-cluster distance

Optimize this in an alternative way


Recap: k-means algorithm

Input: cluster size k, instances , distance metric Output: cluster membership assignments 1. Initialize k cluster centroids (randomly if no domain

knowledge available)2. Repeat until no instance changes its cluster

membership:– Decide the cluster membership of instances by assigning

them to the nearest cluster centroid

– Update the k cluster centroids based on the assigned cluster membership

CS@UVa

Minimize intra distance

Maximize inter distance


Recap: Probabilistic interpretation of clustering

• The density model of is multi-modal• Each mode represents a sub-population– E.g., unimodal Gaussian for each group

CS@UVa

Mixture model

𝑝 (𝑥 )=∑𝑧

𝑝 (𝑥|𝑧 )𝑝 (𝑧)


Mixing proportion

𝑝 (𝑥∨𝑧=1)

𝑝 (𝑥∨𝑧=2)

𝑝 (𝑥∨𝑧=3)


Introduction to EM

• Parameter estimation– All data is observable• Maximum likelihood estimator• Optimize the analytic form of

– Missing/unobservable data• Data: X (observed) + Z (hidden)• Likelihood: • Approximate it!

CS@UVa

Most of cases are intractable

E.g. cluster membership


Background knowledge

• Jensen's inequality– For any convex function and positive weights

CS@UVa

𝑓 (∑𝑖 𝜆𝑖 𝑥𝑖)≤∑𝑖

𝜆𝑖 𝑓 (𝑥 𝑖) ∑𝑖

𝜆𝑖=1


Expectation Maximization

• Maximize data likelihood function by pushing the lower bound

CS@UVa

≥∑𝑍

𝑞 ( 𝑍 ) log𝑝 ( 𝑋 ,𝑍|𝜃 )−∑𝑍

𝑞 (𝑍 ) log𝑞 (𝑍 )Jensen's inequality𝑓 (𝐸 [ 𝑥 ] ) ≥𝐸 [ 𝑓 (𝑥)]

Lower bound: easier to compute, many good properties!

Components we need to tune when optimizing : and

Proposal distributions for


Intuitive understanding of EM

Data likelihood p(X| )

CS@UVa

Lower bound

Easier to optimize, guarantee to improve data likelihood


Expectation Maximization (cont)

• Optimize the lower bound w.r.t.

CS@UVa

¿∑𝑍

𝑞 (𝑍 ) [ log𝑝 ( 𝑍|𝑋 ,𝜃 )+ log𝑝( 𝑋∨𝜃)]−∑𝑍

𝑞 (𝑍 ) log𝑞 (𝑍 )

¿∑𝑍

𝑞 (𝑍 ) log 𝑝 (𝑍|𝑋 ,𝜃 )𝑞(𝑍)

+ log 𝑝 (𝑋∨𝜃)

negative KL-divergence between and Constant with respect to

𝐾𝐿¿




– KL-divergence is non-negative, and equals to zero i.f.f. – A step further: when , we will get , i.e., the lower bound is

tight!– Other choice of cannot lead to this tight bound, but might

reduce computational complexity– Note: calculation of is based on current

CS@UVa



• Optimize the lower bound w.r.t. – Optimal solution:

CS@UVa

Posterior distribution of given current model

In k-means: this corresponds to assigning instance to its closest cluster centroid

𝑧𝑖=𝑎𝑟𝑔𝑚𝑖𝑛𝑘𝑑 (𝑐𝑘 ,𝑥 𝑖)




CS@UVa

Constant w.r.t.

¿𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝐸𝑍∨𝑋 ,𝜃𝑡 [ log𝑝 (𝑋 ,𝑍∨𝜃)]

Expectation of complete data likelihood

In k-means, we are not computing the expectation, but the most probable configuration, and then


Expectation Maximization

• EM tries to iteratively maximize likelihood– “Complete” likelihood: – Starting from an initial guess (0),

1. E-step: compute the expectation of the complete likelihood

2. M-step: compute (t+1) by maximizing the Q-function

CS@UVa

𝑄 (𝜃 ;𝜃𝑡 )=E 𝑍∨𝑋 ,𝜃𝑡 [𝐿𝑐 (𝜃 ) ]=∑𝑍

𝑝 (𝑍|𝑋 ,𝜃 𝑡 ) log p ( X , Z|𝜃 )

𝜃𝑡+1=𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑄 (𝜃 ;𝜃𝑡 ) Key step!


Intuitive understanding of EM

Data likelihood p(X| )

current guess

Lower bound(Q function)

next guess

E-step = computing the lower boundM-step = maximizing the lower bound

CS@UVa

In k-means• E-step: identify the cluster

membership - • M-step: update by


Convergence guarantee

• Proof of EM

CS@UVa

log𝑝 ( 𝑋|𝜃 )=log𝑝 (𝑍 , 𝑋|𝜃 )− log𝑝 (𝑍∨𝑋 ,𝜃)

log𝑝 ( 𝑋|𝜃 )=∑𝑍

𝑝(𝑍∨𝑋 , 𝜃𝑡) log𝑝 ( 𝑍 ,𝑋|𝜃 ) −∑𝑍

𝑝 (𝑍∨𝑋 ,𝜃 𝑡) log𝑝 (𝑍∨𝑋 , 𝜃)Taking expectation with respect to of both sides:

log𝑝 ( 𝑋|𝜃 )− log𝑝 (𝑋∨𝜃𝑡)=𝑄 (𝜃 ;𝜃 𝑡 )+𝐻 (𝜃 ;𝜃𝑡 ) −𝑄 (𝜃𝑡 ;𝜃𝑡 ) −𝐻 (𝜃𝑡 ;𝜃𝑡 )Then the change of log data likelihood between EM iteration is:

By Jensen’s inequality, we know , that means

log𝑝 ( 𝑋|𝜃 )− log𝑝 (𝑋∨𝜃𝑡)≥𝑄 (𝜃 ;𝜃 𝑡 ) −𝑄 (𝜃 𝑡 ;𝜃 𝑡 ) ≥0

¿𝑄 (𝜃 ;𝜃𝑡 )+𝐻 (𝜃 ;𝜃𝑡) Cross-entropy

M-step guarantee this


• Global optimal is not guaranteed!– Likelihood: is non-convex in most of case– EM boils down to a greedy algorithm• Alternative ascent

• Generalized EM– E-step: – M-step: choose that improves

CS@UVa

What is not guaranteed


k-means v.s. Gaussian Mixture

• If we use Euclidean distance in k-means– We have explicitly assumed is Gaussian– Gaussian Mixture Model (GMM)

•

CS@UVa

𝑃 (𝑥|𝑧 )= 1

√ (2𝜋 )𝑘 Σ𝑧

𝑒− 12

(𝑥−𝜇𝑧 )𝑇 Σ z−1 (𝑥−𝜇𝑧 )

Multinomial

𝑃 (𝑥|𝑧 )= 1√2𝜋

𝑒−

(𝑥−𝜇𝑧 )𝑇 (𝑥−𝜇𝑧 )2

In k-means, we assume equal variance across clusters, so we don’t need to estimate them

We do not consider cluster size in k-means


k-means v.s. Gaussian Mixture

• Soft v.s., hard posterior assignment

CS@UVa

GMM k-means


k-means in practice

• Extremely fast and scalable– One of the most popularly used clustering

methods• Top 10 data mining algorithms – ICDM 2006

– Can be easily parallelized• Map-Reduce implementation

– Mapper: assign each instance to its closest centroid– Reducer: update centroid based on the cluster membership

– Sensitive to initialization• Prone to local optimal

CS@UVa


Better initialization: k-means++

1. Choose the first cluster center at uniformly random

2. Repeat until all k centers have been found– For each instance compute – Choose a new cluster center with probability

3. Run k-means with selected centers as initialization

CS@UVa

new center should be far away from existing centers


How to determine k

• Vary to optimize clustering criterion– Internal v.s. external validation– Cross validation• Abrupt change in objective function

CS@UVa


How to determine k

• Vary to optimize clustering criterion– Internal v.s. external validation– Cross validation• Abrupt change in objective function• Model selection criterion – penalizing too many

clusters– AIC, BIC

CS@UVa


What you should know

• k-means algorithm– An alternative greedy algorithm – Convergence guarantee• EM algorithm

– Hard clustering v.s., soft clustering• k-means v.s., GMM

CS@UVa


Today’s reading

• Introduction to Information Retrieval– Chapter 16: Flat clustering• 16.4 k-means• 16.5 Model-based clustering

CS@UVa

k-means clustering hongning wang cs@uva. today’s lecture k-means clustering – a typical...

Documents

text mining2 slide

text mining3 slide

text mining23 slide

text mining24 slide

text mining25 slide

text mining10 slide

text mining12 slide

text mining20 slide