an introduction to pdf estimation and clustering -...

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 1

An Introduction to PDF Estimation and Clustering

David Corrigan

[email protected]

Electrical and Electronic Engineering Dept.,

University of Dublin, Trinity College.

See www.sigmedia.tv for more information.


PDF Estimation

• Quantify the characteristics of a signal, x[n], my measuring its PDF, p(Xn = x)

• Ubiquitous in Signal Processing applications - image segmentation, restoration, texture

synthesis.

0 50 100 150 200 250 3000

0.005

0.01

0.015

Intensity

Pro

babi

lity


PDF Estimation

• Estimators fall into two categories

1. Parametric Estimation

– A known model for the PDF is fitted to the data (e.g. A gaussian distribution for a

noise signal)

– The PDF is then represented by the parameters (mean, variance etc.)

2. Non-Parametric Estimation

– No assumed model for the PDF

– The PDF is estimated by measuring the signal

• A correct parametric model gives a better model from less data than non-parametric

techniques.


Non-Parametric Estimators

• Best known estimator is the histogram.

• Finds the frequency (and hence probability) of a signal value lying in a range.

0 50 100 150 200 2500

0.005

0.01

0.015

0.02

Intensity

Pro

babi

lity

Histogram with bin width of 1

0 50 100 150 200 2500

0.025

0.05

0.075

0.1

Intensity

Pro

babi

lity

Histogram with bin width of 5

• Histograms are poor if they are not adequately populated.

• Can increases the bin width or smoothen the histogram.


Kernel Density Estimation

• Another non-parametric Estimator

• Given a signal x[n] the PDF is:

p(X = x) =1

Nh

N∑

i=1

K

(x− x[i]

h

)(1)

• K(x) is the Kernel and h is the bandwidth.

• Common Kernels include Guassian kernels and the Epanechnikov kernel

K(x) =

k(1− x2) x2 < 1

0 otherwise(2)

The Epanechnikov Kernel −2 −1 0 1 20

k

xK

(x)


Kernel Density Estimation

• A Kernel Density Estimate is “visually comparable” to a smoothened histogram (but

quite different in concept).

• The bandwidth controls the smoothness of the KDE.

• The PDF can be estimated at any signal value (real or complex).

• We dont need to worry about quantising the signal or choosing bin widths.

• It is slow. O(n) to estimate PDF at a signal value.


Gaussian Mixture Models (GMM’s)

• The PDF is weighted a weighted sum of Gaussian Distributions (can be multivariate for

vector valued signals)

p(X = x) =K∑

k=1

π(k)N(x; µ(k), R(k)) (3)

• The model has K components

• π(k) is the weight of each gaussian such that∑

k π(k) = 1.

• µ(k) and R(k) are the mean and variance (or co-variance) of the kth component.

• To create the GMM certain questions need to be answered -

1. How many clusters do we choose?

2. What are the initial estimates for the weights, means and variances?

3. How do we make sure that we have the optimum model for our data?

• To answer these questions we need to talk a bit about clustering.


Clustering

• Clustering involves partitioning the set of signal values into subsets of similar values.

• Used in signal modelling, segmentation, vector quantisation, image compression, ......

• Consider the following 2D vector-valued signal

100 200 300 400 500 600 700

50

100

150

200

250

300

350

400

450

500

550

Vector-valued signal

d(x) = (dx(x), dy(x))

−10 −5 0 5−5

0

5

10

dxdy

Scatter Plot of d(x).


k-Means Clustering

• Algorithm that divides data into an arbitrary number of clusters, K.

• The algorithm attempts to minimize the distance, V , between each data value ,x, and

the cluster centroids.

V =K∑

k=1

∑

j∈Ck

(xj − ck) where Ck is the kth cluster (4)

K-means operates as follows -

1. The user selects the number of clusters and assigns a value to each cluster centroid.

2. Every data point is assigned to the cluster of the nearest centroid. This partitions the

data into clusters.

3. The centroids of the new clusters are estimated (by estimating the mean data value of

each cluster).

4. Steps 2 and 3 are iterated until centroid values are suitably converged.


k-Means Clustering

• Nice demo -

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

• Matlab has its own “kmeans” function.

−10 −5 0 5−5

0

5

10

dx

dy

Scatter Plot of d(x).

−10 −5 0 5−5

0

5

10

dx

dyPartitioned Scatter Plot.

• Related “Fuzzy C-Means” algorithm allows data points to belong to one or more clusters.


Back to GMM’s

• How many components to pick? Usually an arbitrary number but could use mean shift

or watershed algorithms (later).

• How to estimate the initial GMM parameter values? Using a clustering algorithm like

k-means to get rough clustering of the data set.

– The mean and covariance of each component is estimated on the corresponding

cluster. The component weight is the fraction of the overall number of data points in

the cluster.

• How to get the model parameters to best fit the data? By using the Expectation

Maximisation (EM) algorithm.


Expectation Maximisation

EM finds the maximum likelihood estimates for the model parameters. The algorithm has

two steps.

1. The E-Step: The current model parameters are used to cluster the data. A maximum

likelihood solution

k̂(x) = arg max π(k)N(x; µ(k), R(k)) (5)

2. The M-Step: From the data set and the clustered data, the new parameter values are

estimated.

• Given the data, find the model parameters that best fit the clustering obtained from

the E-Step.

• For a GMM this is simplifies to estimating the mean and covariance of the data

points in each cluster. Again the weights are the fraction of points in the cluster.

The parameters are optimised by alternating between the two steps until the parameters

converge.


Expectation Maximisation

• The algorithm is broadly similar to k-means clustering.

– Both algorithms have a clustering stage followed by a parameter estimation stage.

– In k-means the euclidean distance from the centroid is used. In EM, the euclidean

distance from the centroid (i.e. mean) is normalised by the component covariance

and weight.

• Nice demo applet here http://lcn.epfl.ch/tutorial/english/gaussian/html/.


Mean Shift

• Mean Shift clusters the data by finding the peaks of the Kernel Density Estimate.

• The number of clusters is automatically determined.

Remember the equation for the KDE

f(x) =1

Nh

N∑

i=1

K

(x− x[i]

h

)(6)

At a peak the gradient of the data is 0. The gradient is then

∇f(x) =1

Nh

N∑

i=1

∇K

(x− x[i]

h

)(7)


Mean Shift

If we use the Epanechnikov kernel for K(x), its gradient is linear inside the bandwidth.

Therefore

∇f(x) ∝ 1

N

∑

x[i]∈Sh(x)

(x[i]− x) (8)

Sh(x) is an N-Dimensional sphere of radius h.

The RHS of the equation 8 is the dif-

ference between the current point x

and the mean of all the data points in

the sphere centred on x.

It is known as the mean shift vector.


Mean Shift

There are some important things to consider

• The direction of the mean shift vector is the direction of the gradient.

• The direction gradient vector points in the direction of maximum change.

• The gradient vector at a peak is 0. Therefore the mean shift vector is also 0.

The peak can be found by following the mean shift vector to regions of

higher density until the mean shift vector is 0.


Mean Shift

This can be implemented as follows

1. Pick a data point at random.

2. Find the mean of all points in the

sphere centred on the data point.

3. Repeat by searching the sphere cen-

tred on the mean from step 2.

4. Stop when successive means are the

same. The mean is the value of the

peak. Clustering using mean shift. The

chosen bandwidth is 2.5.


Mean Shift

• To cluster the data this procedure is applied to every point in the data.

• Every data point will have a characteristic peak.

• All data points with the same peak are assigned to a cluster.

−10 −5 0 5−5

0

5

10

dx

dy

Clustering using mean shift. The

chosen bandwidth is 2.5.


Mean Shift

It’s good:

• It tells us something about the complexity of the signal.

• No need to guess the number of clusters.

• Bandwidth parameter gives some degree of control.

It’s bad:

• The algorithm is very slow (O(n2)). The distance between every pair of points in the

data set must be known.

• Several publications of attempted to address this including Akash.

• Tendency to get small clusters in regions of low density. Post processing maybe necessary.


Reading

For more information on KDE’s and mean-shift

• Comaniciu and Meer. Mean Shift Analysis and Applications,

http://www.caip.rutgers.edu/~comanici/Papers/MsAnalysis.pdf.

• Akash’s paper from ICIP 2008 on the Path-Assigned Mean-Shift algorithm.

For k-means clustering

• Some lecture notes

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html

For training GMM’s using EM

• The wikipedia entry for Expectation Maximisation gives more detail on EM shows how it

applies to GMM’s.

an introduction to pdf estimation and clustering -...

Documents