an introduction to pdf estimation and clustering -...
TRANSCRIPT
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 1
An Introduction to PDF Estimation and Clustering
David Corrigan
Electrical and Electronic Engineering Dept.,
University of Dublin, Trinity College.
See www.sigmedia.tv for more information.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 2
PDF Estimation
• Quantify the characteristics of a signal, x[n], my measuring its PDF, p(Xn = x)
• Ubiquitous in Signal Processing applications - image segmentation, restoration, texture
synthesis.
0 50 100 150 200 250 3000
0.005
0.01
0.015
Intensity
Pro
babi
lity
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 3
PDF Estimation
• Estimators fall into two categories
1. Parametric Estimation
– A known model for the PDF is fitted to the data (e.g. A gaussian distribution for a
noise signal)
– The PDF is then represented by the parameters (mean, variance etc.)
2. Non-Parametric Estimation
– No assumed model for the PDF
– The PDF is estimated by measuring the signal
• A correct parametric model gives a better model from less data than non-parametric
techniques.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 4
Non-Parametric Estimators
• Best known estimator is the histogram.
• Finds the frequency (and hence probability) of a signal value lying in a range.
0 50 100 150 200 2500
0.005
0.01
0.015
0.02
Intensity
Pro
babi
lity
Histogram with bin width of 1
0 50 100 150 200 2500
0.025
0.05
0.075
0.1
Intensity
Pro
babi
lity
Histogram with bin width of 5
• Histograms are poor if they are not adequately populated.
• Can increases the bin width or smoothen the histogram.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 5
Kernel Density Estimation
• Another non-parametric Estimator
• Given a signal x[n] the PDF is:
p(X = x) =1
Nh
N∑
i=1
K
(x− x[i]
h
)(1)
• K(x) is the Kernel and h is the bandwidth.
• Common Kernels include Guassian kernels and the Epanechnikov kernel
K(x) =
k(1− x2) x2 < 1
0 otherwise(2)
The Epanechnikov Kernel −2 −1 0 1 20
k
xK
(x)
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 6
Kernel Density Estimation
• A Kernel Density Estimate is “visually comparable” to a smoothened histogram (but
quite different in concept).
• The bandwidth controls the smoothness of the KDE.
• The PDF can be estimated at any signal value (real or complex).
• We dont need to worry about quantising the signal or choosing bin widths.
• It is slow. O(n) to estimate PDF at a signal value.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 7
Gaussian Mixture Models (GMM’s)
• The PDF is weighted a weighted sum of Gaussian Distributions (can be multivariate for
vector valued signals)
p(X = x) =K∑
k=1
π(k)N(x; µ(k), R(k)) (3)
• The model has K components
• π(k) is the weight of each gaussian such that∑
k π(k) = 1.
• µ(k) and R(k) are the mean and variance (or co-variance) of the kth component.
• To create the GMM certain questions need to be answered -
1. How many clusters do we choose?
2. What are the initial estimates for the weights, means and variances?
3. How do we make sure that we have the optimum model for our data?
• To answer these questions we need to talk a bit about clustering.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 8
Clustering
• Clustering involves partitioning the set of signal values into subsets of similar values.
• Used in signal modelling, segmentation, vector quantisation, image compression, ......
• Consider the following 2D vector-valued signal
100 200 300 400 500 600 700
50
100
150
200
250
300
350
400
450
500
550
Vector-valued signal
d(x) = (dx(x), dy(x))
−10 −5 0 5−5
0
5
10
dxdy
Scatter Plot of d(x).
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 9
k-Means Clustering
• Algorithm that divides data into an arbitrary number of clusters, K.
• The algorithm attempts to minimize the distance, V , between each data value ,x, and
the cluster centroids.
V =K∑
k=1
∑
j∈Ck
(xj − ck) where Ck is the kth cluster (4)
K-means operates as follows -
1. The user selects the number of clusters and assigns a value to each cluster centroid.
2. Every data point is assigned to the cluster of the nearest centroid. This partitions the
data into clusters.
3. The centroids of the new clusters are estimated (by estimating the mean data value of
each cluster).
4. Steps 2 and 3 are iterated until centroid values are suitably converged.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 10
k-Means Clustering
• Nice demo -
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
• Matlab has its own “kmeans” function.
−10 −5 0 5−5
0
5
10
dx
dy
Scatter Plot of d(x).
−10 −5 0 5−5
0
5
10
dx
dyPartitioned Scatter Plot.
• Related “Fuzzy C-Means” algorithm allows data points to belong to one or more clusters.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 11
Back to GMM’s
• How many components to pick? Usually an arbitrary number but could use mean shift
or watershed algorithms (later).
• How to estimate the initial GMM parameter values? Using a clustering algorithm like
k-means to get rough clustering of the data set.
– The mean and covariance of each component is estimated on the corresponding
cluster. The component weight is the fraction of the overall number of data points in
the cluster.
• How to get the model parameters to best fit the data? By using the Expectation
Maximisation (EM) algorithm.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 12
Expectation Maximisation
EM finds the maximum likelihood estimates for the model parameters. The algorithm has
two steps.
1. The E-Step: The current model parameters are used to cluster the data. A maximum
likelihood solution
k̂(x) = arg max π(k)N(x; µ(k), R(k)) (5)
2. The M-Step: From the data set and the clustered data, the new parameter values are
estimated.
• Given the data, find the model parameters that best fit the clustering obtained from
the E-Step.
• For a GMM this is simplifies to estimating the mean and covariance of the data
points in each cluster. Again the weights are the fraction of points in the cluster.
The parameters are optimised by alternating between the two steps until the parameters
converge.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 13
Expectation Maximisation
• The algorithm is broadly similar to k-means clustering.
– Both algorithms have a clustering stage followed by a parameter estimation stage.
– In k-means the euclidean distance from the centroid is used. In EM, the euclidean
distance from the centroid (i.e. mean) is normalised by the component covariance
and weight.
• Nice demo applet here http://lcn.epfl.ch/tutorial/english/gaussian/html/.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 14
Mean Shift
• Mean Shift clusters the data by finding the peaks of the Kernel Density Estimate.
• The number of clusters is automatically determined.
Remember the equation for the KDE
f(x) =1
Nh
N∑
i=1
K
(x− x[i]
h
)(6)
At a peak the gradient of the data is 0. The gradient is then
∇f(x) =1
Nh
N∑
i=1
∇K
(x− x[i]
h
)(7)
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 15
Mean Shift
If we use the Epanechnikov kernel for K(x), its gradient is linear inside the bandwidth.
Therefore
∇f(x) ∝ 1
N
∑
x[i]∈Sh(x)
(x[i]− x) (8)
Sh(x) is an N-Dimensional sphere of radius h.
The RHS of the equation 8 is the dif-
ference between the current point x
and the mean of all the data points in
the sphere centred on x.
It is known as the mean shift vector.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 16
Mean Shift
There are some important things to consider
• The direction of the mean shift vector is the direction of the gradient.
• The direction gradient vector points in the direction of maximum change.
• The gradient vector at a peak is 0. Therefore the mean shift vector is also 0.
The peak can be found by following the mean shift vector to regions of
higher density until the mean shift vector is 0.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 17
Mean Shift
This can be implemented as follows
1. Pick a data point at random.
2. Find the mean of all points in the
sphere centred on the data point.
3. Repeat by searching the sphere cen-
tred on the mean from step 2.
4. Stop when successive means are the
same. The mean is the value of the
peak. Clustering using mean shift. The
chosen bandwidth is 2.5.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 18
Mean Shift
• To cluster the data this procedure is applied to every point in the data.
• Every data point will have a characteristic peak.
• All data points with the same peak are assigned to a cluster.
−10 −5 0 5−5
0
5
10
dx
dy
Clustering using mean shift. The
chosen bandwidth is 2.5.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 19
Mean Shift
It’s good:
• It tells us something about the complexity of the signal.
• No need to guess the number of clusters.
• Bandwidth parameter gives some degree of control.
It’s bad:
• The algorithm is very slow (O(n2)). The distance between every pair of points in the
data set must be known.
• Several publications of attempted to address this including Akash.
• Tendency to get small clusters in regions of low density. Post processing maybe necessary.
Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 20
Reading
For more information on KDE’s and mean-shift
• Comaniciu and Meer. Mean Shift Analysis and Applications,
http://www.caip.rutgers.edu/~comanici/Papers/MsAnalysis.pdf.
• Akash’s paper from ICIP 2008 on the Path-Assigned Mean-Shift algorithm.
For k-means clustering
• Some lecture notes
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html
For training GMM’s using EM
• The wikipedia entry for Expectation Maximisation gives more detail on EM shows how it
applies to GMM’s.