unsupervised learning and clusteringsrihari/cse555/chap10.part1.pdf · mixture densities and...

26
Unsupervised Learning and Clustering

Upload: others

Post on 18-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Unsupervised Learning and Clustering

Page 2: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Why consider unlabeled samples?

1. Collecting and labeling large set of samples is costlyGetting recorded speech is free, labeling is time consuming

2. Classifier could be designed on small set of labeled samples and tuned on a large unlabeled set

3. Train on large unlabeled set and use supervision on groupings found

4. Characteristics of patterns may change with time5. Unsupervised methods can be used to find useful

features6. Exploratory data analysis may discover presence of

significant subclasses affecting design

Page 3: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Mixture Densities and Identifiability

• Samples come from c classes• Priors are known P(ωi)• Forms of the class-conditional are known• Values for their parameters are unknown• Probability density function of samples is:

)(), |x () ,x (1

j

c

jjj Ppp ωθωθ ∑

=

=

Page 4: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Gradient Ascent for Mixtures• Mixture density:

• Likelihood of observed samples:

• Log-likelihood:

• Gradient w.r.t. θi :

• MLE must satisfy:

)(), |x () ,x (1

j

c

jjj Ppp ωθωθ ∑

=

=

∏=

=n

kkxpDp

1

)|()|( θθ

∑=

=n

kkxpl

1

)|(ln θ

⎥⎦

⎤⎢⎣

⎡∇=∇ ∑∑

==

c

jjjjk

n

k k

Pxpxp

lii

11)(),|(

)|(1 ωθω

θ θθ

0)ˆ,|(ln)ˆ,|(1

=∇∑=

iikk

n

ki xpxP

iθωθω θ

Page 5: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Gaussian Mixture

• Unknown mean vectors, yields

• Leading to an iterative scheme for improving estimates

tn

kki

n

kkki

i

xP

xxP)ˆ,..ˆ(ˆ where

)ˆ,|(

)ˆ,|(ˆ c1

1

1 µµµµω

µωµ ==

=

=

))(ˆ,|(

)1(ˆ 1∑

==+

n

kkki xjxP

jµω

µ))(ˆ,|(

1∑

=

n

kki

i

jxP µω

Page 6: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

k-means clustering

• Gaussian case with all parameters unknown leads to a formulation:

begin initialize n, c, µ1,µ2,..,µc– do classify n samples according to nearest µi– recompute µi

until no change in µiend

Page 7: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

k-means clustering with one feature

One-dimensional example

Six starting points lead local maxima whereas two for both of which µ1(0) = µ2(0) lead to a saddle point

Page 8: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

k-means clustering with two features

Two-dimensional example

There are three means and there are three stepsin the iteration. Voronoi tesselations based on meansare shown

Page 9: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Data Description and Clustering

Page 10: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Data Description

• Learning the structure of multidimensional patterns from a set of unlabelled samples

• Form clouds of points in d-dimensional space

• If data were from a single normal distribution, mean and covariance metric would suffice as a description

Page 11: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Data sets having identical statistics upto second order, i.e., same Σ andµ

Page 12: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Mixture of c normal distributions approach

• Estimating parameters is non-trivial• Assumption of particular parametric forms

– can lead to poor or meaningless results• Alternatively use nonparametric approach:

– peaks or modes can indicate clusters• If goal is to find sub-classes use clustering

procedures

Page 13: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Similarity Measures• Two Issues

1. How to measure similarity between samples?2. How to evaluate partitioning?

• If distance is a good measure of dissimilarity distance between samples in same cluster must be smaller than distance between samples in different clusters

• Two samples belong to the same cluster if distance between them is less than a threshold d0

Distance threshold affects numberand size of

clusters

Page 14: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Similarity Measures for Clustering

• Minkowski Metric

• Metric based on data itself: – Mahanalobis distance

• Angle between vectors as similarity

⎟⎠⎞⎜

⎝⎛ −∑

==

d

1k'kk

q)x'd(x, |xx|

1/qq = 2 is Euclidean, q =1 is Manhattan or city block metric

||'||||||)',('

xxxxxxs

t

=Cosine of angle between vectors is invariant to rotation and dilation but not translation and general linear transformations

Page 15: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Binary Feature Similarity Measures

||'||||||)',('

xxxxxxs

t

= Numerator no of attributes possessed by both x and x’

Denominator (xtxx’tx’)1/2 is geometric mean of no of attributes possessed by x and x’

dxxs xxt '

)',( = Fraction of attributes shared

Tanimoto coefficient: Ratio of number of shared attributes to number possessed by x or x’xxxxxx

xxttt

t

xxs ''

'

')',(

−+=

Page 16: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Issues in Choice of Similarity Function

• Tanimoto coefficient used in Information Retrieval and Taxonomy

• Fundamental issues in Measurement Theory

• Combining features is tricky: inches versus meters

• Nominal, ordinal, interval and ratio scales

Page 17: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Criterion Functions for ClusteringSum of squared errors criterion

Mean of samples in Di ∑∈

=D

xi

xim 2

1

∑ ∑ −= ∈

=c

i xe

Di

i

mxJ1

2||||

Criterion is not best when two clusters are of unequal sizeSuitable when they are compact clouds

Page 18: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Related Minimum Variance Criteria

∑ ∑

∈ ∈

=

−=

=

i iDx Dxii

i

c

iie

xxn

s

snJ

'

22

1

||'||1 where

21

Can be replaced by other similarity function s(x,x’)

Optimal partition extremizes the criterion function

Page 19: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Scatter Criteria

• Derived from Scatter Matrices• Trace criterion• Determinant Criterion• Invariant Criteria

Page 20: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Hierarchical Clustering

Page 21: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Dendrogram

Page 22: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms
Page 23: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Agglomerative Algorithm

Page 24: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Nearest Neighbor Algorithm

Page 25: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

Farthest Neighbor Algorithm

Page 26: Unsupervised Learning and Clusteringsrihari/CSE555/Chap10.Part1.pdf · Mixture Densities and Identifiability • Samples come from c classes • Priors are known P(ω i) • Forms

How to determine nearest clusters