k-mle: a fast algorithm for learning statistical mixture models

k-MLE:A fast algorithm for learning statistical mixture models

(arXiv:1203.5181)

Frank NIELSEN

Sony Computer Science Laboratories, Inc.

28th March 2012International Conference on Acoustics, Speech, and Signal Processing

ICASSP, Kyoto ICC

Outline

I Background

I Statistical mixtures of exponential families (EFMMs)I Legendre transform and mixture dual parameterizations

I ContributionsI k-MLE and its variantsI k-MLE initialization (k-MLE++)

I Summary

Exponential Family Mixture Models (EFMMs)Generalize Gaussian & Rayleigh MMs to many commondistributions.

m(x) =

wipF (x ;λi ) with ∀i wi > 0,∑k

i=1 wi = 1

pF (x ;λ) = e〈t(x),θ〉−F (θ)+k(x)

F : log-Laplace transform (partition, cumulant function):

x∈XpF (x ; θ)dx = 1 ⇒ F (θ) = log

x∈Xe〈t(x),θ〉+k(x)dx ,

θ ∈ Θ =

θ |∫

x∈Xe〈t(x),θ〉+k(x)dx < ∞

the natural parameter space.

I d : Dimension of the support X.I D: order of the family (= dimΘ). Statistic: t(x) : Rd → R

Statistical mixtures: Rayleigh MMs [7, 5]IntraVascular UltraSound (IVUS) imaging:

Rayleigh distribution:

p(x ;λ) = xλ2 e

− x2

x ∈ R+ = X

d = 1 (univariate)D = 1 (order 1)θ = − 1

Θ = (−∞, 0)F (θ) = − log(−2θ)t(x) = x2

k(x) = log x(Weibull k = 2)

Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissuesRayleigh Mixture Models (RMMs):for segmentation and classification tasks

Statistical mixtures: Gaussian MMs [3, 5]Gaussian mixture models (GMMs).Color image interpreted as a 5D xyRGB point set.

Gaussian distribution p(x ;µ,Σ):1

(2π)d2√

|Σ|e−

12DΣ−1 (x−µ,x−µ)

Squared Mahalanobis distance:DQ(x , y) = (x − y)TQ(x − y)x ∈ R

d (multivariate)

D = d(d+3)2 (order)

θ = (Σ−1µ, 12Σ−1) = (θv , θM)

Θ = R× Sd++

F (θ) = 14θ

−1M θv − 1

2 log |θM | +d2 log πt(x) = (x ,−xxT )k(x) = 0

Sampling from a Gaussian Mixture Model (GMM)To sample a variate x from a GMM:

I Choose a component l according to the weight distributionw1, ...,wk ,

I Draw a variate x according to N(µl ,Σl).

Doubly stochastic process:

1. throw a (biased) dice with k faces to choose the component:

l ∼ Multinomial(w1, ...,wk)

(Multinomial distribution belongs also to the exponentialfamilies.)

2. then draw at random a variate x from the l -th component

x ∼ Normal(µl ,Σl)

x = µ+ Cz with Cholesky: Σ = CCT and z = [z1 ... zd ]T

standard normal random variate: zi =√−2 logU1 cos(2πU2)

Statistical mixtures: Generative models of data sets

GMM = feature descriptor for information retrieval (IR)→ classification, matching, etc.Increase dimension using color image patches.Low-frequency information encoded into compact statistical model.

Generative model → statistical image by GMM sampling.

Source GMM Sample

Distance between exponential families: Relative entropy

I Distance between features (e.g., GMMs)

I Kullback-Leibler divergence (cross-entropy minus entropy):

KL(P : Q) =

p(x) logp(x)

q(x)dx ≥ 0

p(x) log1

q(x)dx

︸︷︷︸

H×(P:Q)

−∫

p(x) log1

p(x)dx

︸︷︷︸

H(p)=H×(P:P)

= F (θQ)− F (θP)− 〈θQ − θP ,∇F (θP)〉= BF (θQ : θP)

Bregman divergence BF defined for a strictly convex anddifferentiable function (up to some affine terms).

I Proof KL(P : Q) = BF (θQ : θP) follows from

X ∼ EF (θ) =⇒ E [t(X )] = ∇F (θ)

Bregman divergence: Geometric interpretation

Potential function F , graph plot F : (x ,F (x)).

DF (p : q) = F (p)− F (q)− 〈p − q,∇F (q)〉

Convex duality: Legendre transformation

I For a strictly convex and differentiable function F : X → R:

F ∗(y) = supx∈X

{〈y , x〉 − F (x)︸︷︷︸

lF (y ;x);

I Maximum obtained for y = ∇F (x):

∇x lF (y ; x) = y −∇F (x) = 0 ⇒ y = ∇F (x)

I Maximum unique from convexity of F (∇2F � 0):

∇2x lF (y ; x) = −∇2F (x) ≺ 0

I Convex conjugates:

(F ,X ) ⇔ (F ∗,Y), Y = {∇F (x) | x ∈ X}

Legendre duality & Canonical divergence

I Convex conjugates have functional inverse gradients

∇F−1 = ∇F ∗

∇F ∗ may require numerical approximation(not always available in analytical closed-form)

I Involution: (F ∗)∗ = F .

I Convex conjugate F ∗ expressed using (∇F )−1:

F ∗(y) = 〈(∇F )−1(y), y〉 − F ((∇F )−1(y))

I Fenchel-Young inequality at the heart of canonical divergence:

F (x) + F ∗(y) ≥ 〈x , y〉

AF (x : y) = AF∗(y : x) = F (x) + F ∗(y)− 〈x , y〉 ≥ 0

Dual Bregman divergences & canonical divergence [6]

KL(P : Q) = EP

logp(x)

= BF (θQ : θP) = BF∗(ηP : ηQ)

= F (θQ) + F ∗(ηP)− 〈θQ , ηP〉= AF (θQ : ηP) = AF∗(ηP : θQ)

with θQ (natural parameterization) and ηP = EP [t(X )] = ∇F (θP)(moment parameterization).

Exponential family mixtures: Dual parameterizations

A finite weighted point set {(wi , θi )}ki=1 in a statistical manifold.Many coordinate systems for computing (two canonical):

I usual λ-parameterization,

I natural θ-parameterization and dual η-parameterization.

λ ∈ Λ

η ∈ Hθ ∈ Θ

Exponential familydual parameterization

η = ∇θF (θ) θ = ∇ηF∗(η)

Legendre transform(Θ, F ) ↔ (H,F ∗)

Natural parameters Expectation parameters

Original parameters

Maximum Likelihood Estimator (MLE)Given n identical and independently distributed observationsX = {x1, ..., xn}Maximum Likelihood Estimator

θ = argmaxθ∈Θ

pF (xi ; θ) = argmaxθ∈Θe∑n

i=1〈t(xi ),θ〉−F (θ)+k(xi )

is unique maximum since ∇2F � 0 (Hessian):

∇F (θ) =1

t(xi )

MLE is consistent, efficient with asymptotic normal distribution

θ ∼ N

nI−1(θ)

Fisher information matrix

I (θ) = var[t(X )] = ∇2F (θ)

Duality Bregman ↔ Exponential families [2]

Bregman divergence:BF∗(x : η)

Bregman generator:F ∗(η)

Cumulant function:F (θ)

Exponential family:pF (x|θ)

Legendreduality

η = ∇F (θ)

An exponential family...

pF (x ; θ) = exp(〈t(x), θ〉 − F (θ) + k(x))

has the log-density interpreted as a Bregman divergence:

log pF (x ; θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x)

Exponential families ⇔ Bregman divergences: Examples

F (x) pF (x |θ) ⇔ BF∗

Generator Exponential Family ⇔ Dual Bregman divergence

x2 Spherical Gaussian ⇔ Squared lossx log x Multinomial ⇔ Kullback-Leibler divergencex log x − x Poisson ⇔ I -divergence− log(−2x) Rayleigh ⇔ Itakura-Saito divergence− log x Geometric ⇔ Itakura-Saito divergencelog |X | Wishart ⇔ log-det/Burg matrix div. [8]

Maximum likelihood estimator revisited

θ = argmaxθ∏n

i=1 pF (xi ; θ) = argmaxθ∑n

i=1 log pF (xi ; θ)

argmaxθ

(〈t(xi ), θ〉 − F (θ) + k(xi))

argmaxθ

−BF∗(t(xi ) : η) + F ∗(t(xi )) + k(xi )︸︷︷︸

constant

≡ argminθ

BF∗(t(xi ) : η)

Right-sided Bregman centroid = center of mass: η = 1n

∑ni=1 t(xi).

Bregman batched Lloyd’s k-means [2]Extends Lloyd’s k-means heuristic to Bregman divergences.

I Initialize distinct seeds: C1 = P1, ...,Ck = Pk

I Repeat until convergence

I Assign point Pi to its “closest” centroid (wrt. BF (Pi : C))

Ci = {P ∈ P | BF (P : Ci ) ≤ BF (P : Cj) ∀j 6= i}I Update cluster centroids by taking their center of mass:

P∈CiP .

Loss function

LF (P : C) =∑

BF (P : C)

BF (P : C) = mini∈{1,...,k}

BF (P : Ci )

...monotonically decreases and converges to a local optimum.(Extend to weighted point sets using barycenters.)

k-MLE for EFMM ≡ Bregman Hard Clustering [4]Bijection exponential families (distributions) ↔ Bregman distances

log pF (x ; θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x), η = ∇F (θ)

Bregman k-MLE for EFMMs (F ) = additively weighted Bregmanhard k-means for F ∗ in the space {yi = t(xi )}i :Complete log-likelihood log

∏ni=1

∏kj=1(wjpF (xi |θj))δj (zi ):

= maxθ,w

δj(zi )(log pF (xi |θj) + logwj)

minH,w

δj(zi )((BF∗(t(xi) : ηj )− logwj)−k(xi )− F ∗(t(xi )︸︷︷︸

constant

≡ minη,w

minj=1

BF∗(t(xi ) : ηj)− logwj

Complete average log-likelihood optimizationMinimize monotonically the complete average log-likelihood:

nminH,w

minj=1

BF∗(t(xi ) : ηj)− logwj

I 1. Constant weights → dual additive Bregman k-means

minj=1

(BF∗(t(xi ) : ηj)− logwj)

I 2. Component moment parameters η fixed:

−δj (zi) logwj = minw

−αj logwj ,

where αj =|Cj |n. That is, minimize the cross-entropy:minw H×(α : w) ⇒ w = α.

I Go to 1 until (local) convergence is met.

k-MLE-EFMM algorithm [4]

I 0. Initialization: ∀i ∈ {1, ..., k}, let wi =1kand ηi = t(xi)

(initialization is further discussed later on).

I 1. Assignment:∀i ∈ {1, ..., n}, zi = argminkj=1BF∗(t(xi) : ηj)− logwj .Let Ci = {xj |zj = i},∀i ∈ {1, ..., k} be the cluster partition:X = ∪k

i=1Ci .I 2. Update the η-parameters:

∀i ∈ {1, ..., k}, ηi = 1|Ci |

x∈Cit(x).

Goto step 1 unless local convergence of the completelikelihood is reached.

I 3. Update the mixture weights: ∀i ∈ {1, ..., k},wi =1n|Ci |.

Goto step 1 unless local convergence of the completelikelihood is reached.

k-MLE initialization

I Forgy’s random seed (d = D),

I Bregman k-means (for F ∗ on Y, and MLE on each cluster).

Usually D > d (eg., multivariate Gaussians D = d(d+3)2 )

I Compute global MLE η = 1n

∑ni=1 t(xi)

(well-defined for n ≥ D → θ ∈ Θ)

I Consider restricted exponential family for Fθ(d+1...D)(θ(1...d)),

then set η(1...d)i = t(1...d)(xi) and η

(d+1...D)i = η(d+1...D).

(e.g., we fix global covariance matrix, and let µi = xi forGaussians)

I Improve initialization by applying Bregman k-means++ [1] forthe convex conjugate of F

θ(d+1...D)(θ(1...d))

k-MLE++ based on Bregman k-means++

k-MLE variants using any Bregman k-means heuristic

I Any k-means optimization heuristic allows one to update themixture η-parameters.

I Hartigan & Wang’s greedy swap (after Lloyd convergence)

I Kanungo et al. swap ((9 + ε)-approximation)

I Performing successively mixture η and w parameters yieldHard EM variant.(easily implemented by winner-take-all EM weightmembership)

k-MLE for MVNs with the (µ,Σ) parameters

I 0. Initialization:

I Calculate global mean µ and global covariance matrix Σ:µ = 1

∑ki=1 xi , Σ = 1

∑ki=1 xix

Ti − µµT

I ∀i ∈ {1, ..., k}, initialize the ith seed as (µi = xi ,Σi = Σ).

I 1. Assignment: ∀i ∈ {1, ..., n}zi = argminkj=1MΣ−1

i(x − µi , x − µi) + log |Σi | − 2 logwi with

MΣ−1i(x − µi , x − µi) the squared Mahalanobis distance:

MQ(x , y) = (x − y)TQ(x − y). LetCi = {xj |zj = i},∀i ∈ {1, ..., k} be the cluster partition:X = ∪k

i=1Ci .I 2. Update the parameters:

∀i ∈ {1, ..., k}, µi =1|Ci |

x∈Cix ,Σi =

1|Ci |

x∈CixxT − µiµ

Goto step 1 unless local convergence.

I 3. Update the mixture weights: ∀i ∈ {1, ..., k},wi =|Ci |n.

Goto step 1 unless local convergence.

Summary of contributions

I Hard k-MLE versus soft EM:

I k-MLE maximizes locally the complete likelihoodI EM maximizes the incomplete likelihood

I The component parameter η update can be implementedusing any Bregman k-means heuristic on conjugate F ∗,

I Initialization can be performed using k-MLE++

I Indivisibility: Robustness when identifying statistical mixture

models? Which k? ∀k ∈ N, N(µ, σ2) =∑k

i=1N(µk, σ

Simplifying mixtures from kernel density estimators is onefine-to-coarse solution. See:Model centroids for the simplification of kernel densityestimators, ICASSP 2012, March 29th.

Marcel R. Ackermann and Johannes Blomer.Bregman clustering for separable instances.In Scandinavian Workshop on Algorithm Theory (SWAT),pages 212–223, 2010.

Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, andJoydeep Ghosh.Clustering with Bregman divergences.Journal of Machine Learning Research, 6:1705–1749, 2005.

Vincent Garcia and Frank Nielsen.Simplification and hierarchical representations of mixtures ofexponential families.Signal Processing (Elsevier), 90(12):3197–3212, 2010.

Frank Nielsen.k-MLE: A fast algorithm for learning statistical mixturemodels.In IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP). IEEE, 2012.

preliminary, technical report on arXiv.

Frank Nielsen and Vincent Garcia.Statistical exponential families: A digest with flash cards,2009.arXiv.org:0911.4863.

Frank Nielsen and Richard Nock.Entropies and cross-entropies of exponential families.In International Conference on Image Processing (ICIP), pages3621–3624, 2010.

Jose Seabra, Francesco Ciompi, Oriol Pujol, Josepa Mauri,Petia Radeva, and Joao Sanchez.Rayleigh mixture model for plaque characterization inintravascular ultrasound.IEEE Transaction on Biomedical Engineering,58(5):1314–1324, 2011.

Shijun Wang and Rong Jin.

An information geometry approach for distance metriclearning.Journal of Machine Learning Research, 5:591–598, 2009.

Anisotropic Voronoi diagram (for MVN MMs)From the source color image (a), we buid a 5D GMM with k = 32components, and color each pixel with the mean color of theanisotropic Voronoi cell it belongs to

(a) (b)

Speed-up assignment step using Bregman ball trees or Bregmanvantage point trees.

Expectation-maximization (EM) for EFMMs [2]EM increases monotonically the expected complete likelihood(marginalize):

p(zj |xi , θ) log p(xi , zj |θ)

Banerjee et al. [2] proved it amounts to a Bregman soft clustering:

Comparisons: k-MLE vs. EM for EFMMs

k-MLE/Hard EM Soft EM (1977)= Bregman hard clustering = Bregman soft clustering

Memory lighter heavier (W matrix)

Speed lighter (VP-tree) heavier (all weights wij)

Conv. always finitely ∞, stopping criterion

Init. k-MLE++ k-means(++)

k-mle: a fast algorithm for learning statistical mixture models

Science

mle brochure 2015

gaussian mixture models and em algorithm · gaussian...

gaussian mixture models and expectation-maximization...

expectation maximization (em) and gaussian mixture...

mtb mle project

3d geometry coding using mixture models and the estimation...

mle and me!

orientation mle general

arancel mle 2016

a stochastic version of the em algorithm for mixture cure

mle readiness pub

an independent component analysis mixture model with...

smem algorithm for mixture models - nips

gaussian mixture model em...

label switch in mixture model and relabeling...

scene identification and its effect on cloud radiative...

a fluid-mixture type algorithm for compressible...

road map of the mtb mle program - mtb mle...

mle history

cagayandeoro mle