minimum probability flow learning

Minimum Probability Flow Learning

Jascha Sohl-Dicksteinab∗ [email protected] Battaglinoac∗ [email protected] R. DeWeeseacd [email protected] Center for Theoretical Neuroscience, b Biophysics Graduate Group, c Physics Department, d HelenWills Neuroscience Institute, University of California, Berkeley, 94720 ∗These authors contributed equally.

Abstract

Fitting probabilistic models to data is of-ten difficult, due to the general intractabil-ity of the partition function and its deriva-tives. Here we propose a new parameter esti-mation technique that does not require com-puting an intractable normalization factor orsampling from the equilibrium distribution ofthe model. This is achieved by establishingdynamics that would transform the observeddata distribution into the model distribution,and then setting as the objective the mini-mization of the KL divergence between thedata distribution and the distribution pro-duced by running the dynamics for an in-finitesimal time. Score matching, minimumvelocity learning, and certain forms of con-trastive divergence are shown to be specialcases of this learning technique. We demon-strate parameter estimation in Ising mod-els, deep belief networks and an independentcomponent analysis model of natural scenes.In the Ising model case, current state of theart techniques are outperformed by at leastan order of magnitude in learning time, withlower error in recovered coupling parameters.

1. Introduction

Estimating parameters for probabilistic models is afundamental problem in many scientific and engi-neering disciplines. Unfortunately, most probabilisticlearning techniques require calculating the normaliza-tion factor, or partition function, of the probabilisticmodel in question, or at least calculating its gradi-ent. For the overwhelming majority of models there

Appearing in Proceedings of the 28 th International Con-ference on Machine Learning, Bellevue, WA, USA, 2011.Copyright 2011 by the author(s)/owner(s).

are no known analytic solutions. Thus, developmentof powerful new techniques for parameter estimationpromises to greatly expand the variety of models thatcan be fit to complex data sets.

Many approaches exist for approximate learning, in-cluding mean field theory and its expansions, varia-tional Bayes techniques and a variety of sampling ornumerical integration based methods (Tanaka, 1998;Kappen & Rodrıguez, 1997; Jaakkola & Jordan, 1997;Haykin, 2008). Of particular interest are contrastivedivergence (CD), developed by Hinton, Welling andCarreira-Perpinan (Welling & Hinton, 2002; Carreira-Perpinan & Hinton, 2004), Hyvarinen’s score match-ing (SM) (Hyvarinen, 2005), Besag’s pseudolikeli-hood (PL) (Besag, 1975), and the minimum velocitylearning framework proposed by Movellan (Movellan,2008a;b; Movellan & McClelland, 1993).

Contrastive divergence (Welling & Hinton, 2002;Carreira-Perpinan & Hinton, 2004) is a variation onsteepest gradient descent of the maximum (log) likeli-hood (ML) objective function. Rather than integrat-ing over the full model distribution, CD approximatesthe partition function term in the gradient by averag-ing over the distribution obtained after taking a few,or only one, Markov chain Monte Carlo (MCMC) stepsaway from the data distribution (Equation 17). Qual-itatively, one can imagine that the data distributionis contrasted against a distribution that has evolvedonly a small distance towards the model distribution,whereas it would be contrasted against the true modeldistribution in traditional MCMC approaches. Al-though CD is not guaranteed to converge to the rightanswer, or even to a fixed point, it has proven to bean effective and fast heuristic for parameter estimation(MacKay, 2001; Yuille, 2005).

Score matching (Hyvarinen, 2005) is a method thatlearns parameters in a probabilistic model using onlyderivatives of the energy function evaluated over thedata distribution (see Equation (19)). This sidesteps


Progression of Learning

Figure 1. An illustration of parameter estimation usingminimum probability flow (MPF). In each panel, the axesrepresent the space of all probability distributions. Thethree successive panels illustrate the sequence of parame-ter updates that occur during learning. The dashed redcurves indicate the family of model distributions p(∞)(θ)parametrized by θ. The black curves indicate deterministicdynamics that transform the data distribution p(0) into themodel distribution p(∞)(θ). Under maximum likelihoodlearning, model parameters θ are chosen so as to minimizethe Kullback–Leibler (KL) divergence between the datadistribution p(0) and the model distribution p(∞)(θ). Un-der MPF, however, the KL divergence between p(0) andp(ε) is minimized instead, where p(ε) is the distributionobtained by initializing the dynamics at the data distribu-tion p(0) and then evolving them for an infinitesimal time ε.Here we represent graphically how parameter updates thatpull p(ε) towards p(0) also tend to pull p(∞)(θ) towardsp(0).

the need to explicitly sample or integrate over themodel distribution. In score matching one minimizesthe expected square distance of the score function withrespect to spatial coordinates given by the data distri-bution from the similar score function given by themodel distribution. A number of connections havebeen made between score matching and other learn-ing techniques (Hyvarinen, 2007; Sohl-Dickstein & Ol-shausen, 2009; Movellan, 2008a; Lyu, 2009).

Pseudolikelihood (Besag, 1975) approximates the jointprobability distribution of a collection of random vari-ables with a computationally tractable product of con-ditional distributions, where each factor is the distri-bution of a single random variable conditioned on theothers. This approach often leads to surprisingly goodparameter estimates, despite the extreme nature of theapproximation.

Minimum velocity learning is an approach recentlyproposed by Movellan (Movellan, 2008a) that recastsa number of the ideas behind CD, treating the min-imization of the initial dynamics away from the datadistribution as the goal itself rather than a surrogatefor it. Rather than directly minimize the difference be-tween the data and the model, Movellan’s proposal isto introduce system dynamics that have the model astheir equilibrium distribution, and minimize the initial

flow of probability away from the data under those dy-namics. If the model looks exactly like the data therewill be no flow of probability, and if model and data aresimilar the flow of probability will tend to be minimal.Movellan applies this intuition to the specific case ofdistributions over continuous state spaces evolving viadiffusion dynamics, and recovers the score matchingobjective function.

Two additional recent techniques deserve mention.Minimum KL contraction (Lyu, 2011) involves apply-ing a contraction mapping to both data and modeldistributions, and minimizing the amount by whichthis contraction mapping shrinks the KL divergencebetween data and model distributions. Like minimumprobability flow, it appears to be a generalization of anumber of existing parameter estimation techniquesbased on “local” information about the model dis-tribution. Noise contrastive estimation (Gutmann &Hyvarinen, 2010) estimates model parameters and thepartition function by training a classifier to distinguishbetween the data distribution and a noise distributioncarefully chosen to resemble the data distribution.

Here we propose a consistent parameter estimationframework called minimum probability flow learning(MPF), applicable to any parametric model withoutlatent variables. Minimum velocity learning, SM andcertain forms of CD are all special cases of MPF, whichis in many situations more powerful than any of theseother algorithms. We demonstrate that learning un-der this framework is effective and fast in a number ofcases: Ising models (Brush, 1967; Ackley et al., 1985),deep belief networks (Hinton et al., 2006), and inde-pendent component analysis (Bell AJ, 1995).

2. Minimum Probability Flow

Our goal is to find the parameters that cause a proba-bilistic model to best agree with a list D of (assumediid) observations of the state of a system. We will dothis by introducing deterministic dynamics that guar-antee the transformation of the data distribution intothe model distribution, and then minimizing the KLdivergence between the data distribution and the dis-tribution that results from running those dynamics fora short time ε (see Figure 1).

2.1. Distributions

The data distribution is represented by a vector p(0),

with p(0)i the fraction of the observations D in state

i. The superscript (0) represents time t = 0 underthe system dynamics (which will be described in moredetail in Section 2.2). For example, in a two variable


data distribution model distributiondynamics

00 01 10 11 00 01 10 1100 01 10 11

p(t)i =

�

j

Γij (θ) p(t)j (θ)p

(0)i =

�

j

Γij (θ) p(0)j

p(0)i = data p

(∞)i (θ) =

e−Ei(θ)

Z (θ)

p(∞)i = 0

Figure 2. Dynamics of minimum probability flow learn-ing. Model dynamics represented by the probability flowmatrix Γ (middle) determine how probability flows fromthe empirical histogram of the sample data points (left)to the equilibrium distribution of the model (right) aftera sufficiently long time. In this example there are onlyfour possible states for the system, which consists of a pairof binary variables, and the particular model parametersfavor state 10 whereas the data falls on other states.

binary system, p(0) would have four entries represent-ing the fraction of the data in states 00, 01, 10 and 11(Figure 2).

Our goal is to find the parameters θ that cause a modeldistribution p(∞) (θ) to best match the data distribu-tion p(0). The superscript (∞) on the model distribu-tion indicates that this is the equilibrium distributionreached after running the dynamics for infinite time.Without loss of generality, we assume the model dis-tribution is of the form

p(∞)i (θ) =

exp (−Ei (θ))

Z (θ), (1)

where E (θ) is referred to as the energy function, andthe normalizing factor Z (θ) is the partition function,

Z (θ) =∑

i

exp (−Ei (θ)) (2)

(this can be thought of as a Boltzmann distribution ofa physical system with kBT set to 1).

2.2. Dynamics

Most Monte-Carlo algorithms rely on two core con-cepts from statistical physics, the first being conserva-tion of probability as enforced by the master equationfor the time evolution of a distribution p(t) (Pathria,1972):

p(t)i =

∑

j 6=iΓij(θ) p

(t)j −

∑

j 6=iΓji(θ) p

(t)i , (3)

where p(t)i is the time derivative of p

(t)i . Transition

rates Γij(θ), for i 6= j, give the rate at which probabil-ity flows from a state j into a state i. The first termof Equation (3) captures the flow of probability out ofother states j into the state i, and the second capturesflow out of i into other states j. The dependence on θresults from the requirement that the chosen dynam-ics cause p(t) to flow to the equilibrium distributionp(∞)(θ). For readability, explicit dependence on θ willbe dropped except where necessary. If we choose to setthe diagonal elements of Γ to obey Γii = −

∑j 6=i Γji,

then we can write the dynamics as

p(t) = Γp(t) (4)

(see Figure 2). The unique solution for p(t) is givenby1

p(t) = exp (Γt) p(0), (5)

where exp (Γt) is a matrix exponential.

2.3. Detailed Balance

The second core concept is detailed balance,

Γji p(∞)i (θ) = Γij p

(∞)j (θ) , (6)

which states that at equilibrium the probability flowfrom state i into state j equals the probability flowfrom j into i. When satisfied, detailed balance guar-antees that the distribution p(∞) (θ) is a fixed point ofthe dynamics. Sampling in most Monte Carlo meth-ods is performed by choosing Γ consistent with Equa-tion 6 (and the added requirement of ergodicity), thenstochastically running the dynamics of Equation 3.Note that there is no need to restrict the dynamics de-fined by Γ to those of any real physical process, suchas diffusion.

Equation 6 can be written in terms of the model’s en-ergy function E (θ) by substituting in Equation 1 forp(∞) (θ):

Γji exp (−Ei (θ)) = Γij exp (−Ej (θ)) . (7)

Γ is underconstrained by the above equation. Intro-ducing the additional constraint that Γ be invariantto the addition of a constant to the energy function(as the model distribution p(∞) (θ) is), we choose thefollowing form for the non-diagonal entries in Γ

Γij = gij exp

[1

2(Ej (θ)− Ei (θ))

](i 6= j) , (8)

1 The form chosen for Γ in Equation (4), coupled withthe satisfaction of detailed balance and ergodicity intro-duced in section 2.3, guarantees that there is a uniqueeigenvector p(∞) of Γ with eigenvalue zero, and that allother eigenvalues of Γ have negative real parts.


where the connectivity function

gij = gji =

{0 unconnected states1 connected states

(i 6= j)

(9)

determines which states are allowed to directly ex-change probability with each other2. gij can be setsuch that Γ is extremely sparse (see Section 2.5). The-oretically, to guarantee convergence to the model dis-tribution, the non-zero elements of Γ must be chosensuch that, given sufficient time, probability can flowbetween any pair of states (ergodicity).

2.4. Objective Function

Maximum likelihood parameter estimation involvesmaximizing the likelihood of some observations D un-der a model, or equivalently minimizing the KL diver-gence between the data distribution p(0) and modeldistribution p(∞),

θML = argminθ

DKL

(p(0)||p(∞) (θ)

)(10)

Rather than running the dynamics for infinite time, wepropose to minimize the KL divergence after runningthe dynamics for an infinitesimal time ε,

θMPF = argminθ

K (θ) (11)

K (θ) = DKL

(p(0)||p(ε) (θ)

). (12)

For small ε, DKL

(p(0)||p(ε) (θ)

)can be approximated

by a first order Taylor expansion,

K (θ) ≈ DKL

(p(0)||p(t) (θ)

) ∣∣∣t=0

+ ε∂DKL

(p(0)||p(t) (θ)

)

∂t

∣∣∣t=0

. (13)

Further algebra (see Appendix A) reduces K (θ) to ameasure of the flow of probability, at time t = 0 underthe dynamics, out of data states j ∈ D into non-datastates i /∈ D,

K (θ) =ε

|D|∑

i/∈D

∑

j∈DΓij (14)

=ε

|D|∑

j∈D

∑

i/∈Dgij exp

[1

2(Ej (θ)− Ei (θ))

]

(15)

2The non-zero Γ may also be sampled from a proposaldistribution rather than set via a deterministic scheme, inwhich case gij takes on the role of proposal distribution -see Appendix D.

with gradient

∂K (θ)

∂θ=

ε

|D|∑

j∈D

∑

i/∈D

[∂Ej (θ)

∂θ− ∂Ei (θ)

∂θ

]

gij exp

[1

2(Ej (θ)− Ei (θ))

], (16)

where |D| is the number of observed data points. Notethat Equations (14) and (16) do not depend on thepartition function Z (θ) or its derivatives.

K (θ) is uniquely zero when p(0) and p(∞) (θ) areequal. This implies consistency, in that if the datacomes from the model class, in the limit of infinitedata K (θ) will be minimized by exactly the right θ.In addition, K (θ) is convex for all models p(∞) (θ) inthe exponential family - that is, models whose energyfunctions E (θ) are linear in their parameters θ (Macke& Gerwinn, 2009) (see Appendix B).

2.5. Tractability

The dimensionality of the vector p(0) is typically huge,as is that of Γ (e.g., 2d and 2d × 2d, respectively,for a d-bit binary system). Naıvely, this would seemto prohibit evaluation and minimization of the objec-tive function. Fortunately, we need only visit thosecolumns of Γij corresponding to data states, j ∈ D.Additionally, gij can be populated so as to connecteach state j to only a small fixed number of additionalstates i. The cost in both memory and time to evalu-ate the objective function is thus O(|D|), and does notdepend on the number of system states, only on the(much smaller) number of observed data points.

2.6. Continuous State Spaces

Although we have motivated this technique using sys-tems with a large, but finite, number of states, it gen-

eralizes to continuous state spaces. Γji, gji, and p(t)i

become continuous functions Γ (xj ,xi), g (xj ,xi), andp(t) (xi). Γ (xj ,xi) can be populated stochasticallyand extremely sparsely (see Appendix D), preservingthe O(|D|) cost. A specific scheme (similar to CD withHamiltonian Monte Carlo) for estimating parametersin a continuous state space via MPF is described inAppendix E.

2.7. Choosing the Connectivity Function g

Qualitatively, the most informative states to connectdata states to are those that are most probable underthe model. In discrete state spaces, nearest neighborconnectivity schemes for gji work extremely well (egEquation 21 below). This is because, as learning con-verges, the states that are near data states become the


states that are probable under the model.

In continuous state spaces, the estimated parametersare much more sensitive to the choice of g (xj ,xi). Oneeffective form for g (xj ,xi) is described in AppendixE, but theory supporting different choices of g (xj ,xi)remains an area of active exploration.

3. Connection to Other LearningTechniques

3.1. Contrastive Divergence

The contrastive divergence update rule can be writtenin the form

∆θCD ∝ −∑

j∈D

∑

i/∈D

[∂Ej (θ)

∂θ− ∂Ei (θ)

∂θ

]Tij , (17)

where Tij is the probability of transitioning from statej to state i in a single Markov chain Monte Carlostep (or k steps for CD-k). Equation 17 has obvi-ous similarities to the MPF learning gradient in Equa-tion 16. Thus, steepest gradient descent under MPFresembles CD updates, but with the MCMC sam-pling/rejection step Tij replaced by a weighting factorgij exp

[12 (Ej (θ)− Ei (θ))

].

Note that this difference in form provides MPF witha well-defined objective function. One important con-sequence of the existence of an objective function isthat MPF can readily utilize general purpose, off-the-shelf optimization packages for gradient descent, whichwould have to be tailored in some way to be appliedto CD. This is part of what accounts for the dramaticdifference in learning time between CD and MPF insome cases (see Fig. 3).

3.2. Score Matching

For a continuous state space, MPF reduces to scorematching if the connectivity function g (xj ,xi) is setto connect all states within a small distance r of eachother,

g(xi,xj) = g(xj ,xi) =

{0 d(xi,xj) > r1 d(xi,xj) ≤ r

, (18)

where d(xi,xj) is the Euclidean distance betweenstates xi and xj . In the limit as r goes to 0 (withinan overall constant and scaling factor),

limr→0

K (θ) ∼ KSM (θ)

=∑

x∈D

[1

2∇E(x) · ∇E(x)−∇2E(x)

],

(19)

where KSM (θ) is the SM objective function (see Ap-pendix C). Unlike SM, MPF is applicable to any para-metric model, including discrete systems, and it doesnot require evaluating a third order derivative, whichcan result in unwieldy expressions.

4. Experimental Results

Matlab code implementing MPF for several cases isavailable at https://github.com/Sohl-Dickstein/

Minimum-Probability-Flow-Learning.

All minimization was performed using minFunc(Schmidt, 2005).

4.1. Ising Model

The Ising model has a long and storied history inphysics (Brush, 1967) and machine learning (Ackleyet al., 1985) and it has recently been found to be a sur-prisingly useful model for networks of neurons in theretina (Schneidman et al., 2006; Shlens et al., 2006).

We estimated parameters for an Ising model (some-times referred to as a fully visible Boltzmann machineor an Ising spin glass) of the form

p(∞)(x; J) =1

Z(J)exp

[−xTJx

], (20)

where the coupling matrix J only had non-zero el-ements corresponding to nearest-neighbor units in atwo-dimensional square lattice, and bias terms alongthe diagonal. The training data D consisted of 20, 000d-element iid binary samples x ∈ {0, 1}d generated viaSwendsen-Wang sampling (Swendsen & Wang, 1987)from a spin glass with known coupling parameters. Weused a square 10 × 10 lattice, d = 102. The non-diagonal nearest-neighbor elements of J were set us-ing draws from a normal distribution with varianceσ2 = 10. The diagonal (bias) elements of J were setin such a way that each column of J summed to 0, sothat the expected unit activations were 0.5. The tran-sition matrix Γ had 2d × 2d elements, but for learningwe populated it sparsely, setting

gij = gji =

{1 states i, j differ by single bit flip0 otherwise

.

(21)

Figure 3 shows the mean square error in the estimatedJ and the mean square error in the corresponding pair-wise correlations as a function of learning time forMPF and four competing approaches: mean field the-ory with TAP corrections (Tanaka, 1998), CD withboth one and ten sampling steps per iteration, and

https://github.com/Sohl-Dickstein/Minimum-Probability-Flow-Learning

https://github.com/Sohl-Dickstein/Minimum-Probability-Flow-Learning


0 20 40 6010−2

10−1

100

101

Time (s)

Mea

n sq

uare

cou

plin

g er

ror

(a)

0 20 40 6010−3

10−2

10−1

Time (s)

Mea

n sq

uare

cor

rela

tion

erro

r

(d)

0 200 400 600 80010−2

10−1

100

101

Time (s)

Mea

n sq

uare

cou

plin

g er

ror

(b)

0 200 400 600 80010−3

10−2

10−1

Time (s)

Mea

n sq

uare

cor

rela

tion

erro

r

(e)

0 0.5 1 1.5 2 2.5x 104

10−2

10−1

100

101

Time (s)

Mea

n sq

uare

cou

plin

g er

ror

MPFMFT+TAPCD1CD10Pseudolikelihood

(c)

0 0.5 1 1.5 2 2.5x 104

10−3

10−2

10−1

Time (s)

Mea

n sq

uare

cor

rela

tion

erro

r

(f)

Figure 3. A demonstration of Minimum Probability Flow(MPF) outperforming existing techniques for parameter re-covery in an Ising model. (a) Time evolution of the meansquare error in the coupling strengths for 5 methods for thefirst 60 seconds of learning. Note that mean field theorywith second order corrections (MFT+TAP) actually in-creases the error above random parameter assignments inthis case. (b) Mean square error in the coupling strengthsfor the first 800 seconds of learning. (c) Mean square errorin coupling strengths for the entire learning period. (d)–(f) Mean square error in pairwise correlations for the first60 seconds of learning, the first 800 seconds of learning,and the entire learning period, respectively. In every com-parison above MPF finds a better fit, and for all cases butMFT+TAP does so in a shorter time (see Table 1).

pseudolikelihood. Using MPF, learning took approxi-mately 60 seconds, compared to roughly 800 secondsfor pseudolikelihood and upwards of 20, 000 secondsfor 1-step and 10-step CD. Note that given sufficienttraining samples, MPF would converge exactly to theright answer, as learning in the Ising model is con-vex (see Appendix B), and has its global minimumat the true solution. Table 1 shows the relative per-formance at convergence in terms of mean square er-ror in recovered weights, mean square error in the re-sulting model’s correlation function, and convergencetime. MPF was dramatically faster to converge thanany of the other models tested, with the exception ofMFT+TAP, which failed to find reasonable parame-ters. MPF fit the model to the data substantially bet-ter than any of the other models.

Table 1. Mean square error in recovered coupling strengths(εJ), mean square error in pairwise correlations (εcorr) andlearning time for MPF versus mean field theory with TAPcorrection (MFT+TAP), 1-step and 10-step contrastive di-vergence (CD-1 and CD-10), and pseudolikelihood (PL).

Technique εJ εcorr Time (s)MPF 0.0172 0.0025 ∼60MFT+TAP 7.7704 0.0983 0.1CD-1 0.3196 0.0127 ∼20000CD-10 0.3341 0.0123 ∼20000PL 0.0582 0.0036 ∼800

4.2. Deep Belief Network

As a demonstration of learning on a more complexdiscrete valued model, we trained a 4 layer deep beliefnetwork (DBN) (Hinton et al., 2006) on MNIST hand-written digits. A DBN consists of stacked restrictedBoltzmann machines (RBMs), such that the hiddenlayer of one RBM forms the visible layer of the next.Each RBM has the form

p(∞)(xvis,xhid; W) =exp

[xThidWxvis

]

Z(W), (22)

p(∞)(xvis; W) =exp [

∑k log (1 + exp [Wkxvis])]

Z(W).

(23)

Sampling-free application of MPF requires analyti-cally marginalizing over the hidden units. RBMs weretrained in sequence, starting at the bottom layer, on10,000 samples from the MNIST postal hand writtendigits data set. As in the Ising case, the transition ma-trix Γ was populated so as to connect every state to allstates that differed by only a single bit flip (Equation21). Training was performed by both MPF and singlestep CD (note that CD turns into full ML learning asthe number of steps is increased, and that many stepCD would have produced a superior, more computa-tionally expensive, answer).

Confabulations were generated by Gibbs samplingfrom the top layer RBM, then propagating each sam-ple back down to the pixel layer by way of the condi-tional distribution p(∞)(xvis|xhid; Wk) for each of theintermediary RBMs, where k indexes the layer in thestack. 1, 000 sampling steps were taken between eachconfabulation. As shown in Figure 4, MPF learned agood model of handwritten digits.

4.3. Independent Component Analysis

As a demonstration of parameter estimation in con-tinuous state space probabilistic models, we trainedthe receptive fields J ∈ RK×K of a K dimensional in-


(a)28x28 pixels

200 units

200 units

200 units

200 units

(b)

(c)

Figure 4. A deep belief network trained using minimumprobability flow learning (MPF). (a) A four layer deep be-lief network was trained on the MNIST postal hand writtendigits dataset by MPF and single step contrastive diver-gence (CD). (b) Confabulations after training via MPF. Areasonable probabilistic model for handwritten digits hasbeen learned. (c) Confabulations after training via CD.The uneven distribution of digit occurrences suggests thatCD-1 has learned a less representative model than MPF.

(a) (b)

Figure 5. A continuous state space model fit using min-imum probability flow learning (MPF). Learned 10 × 10pixel independent component analysis receptive fields Jtrained on natural image patches via (a) MPF and (b)maximum likelihood learning (ML). The average log like-lihood of the model found by MPF (−120.61 nats) wasnearly identical to that found by ML (−120.33 nats), con-sistent with the visual similarity of the receptive fields.

dependent component analysis (ICA) (Bell AJ, 1995)model with a Laplace prior,

p(∞) (x; J) =e−

∑k|Jkx|

2K |J−1|, (24)

on 100, 000 10 × 10 whitened natural image patchesfrom the van Hateren database (Hateren & Schaaf,1998). Since the log likelihood and its gradient canbe calculated analytically for ICA, we solved for Jvia both maximum likelihood learning and MPF, andcompared the resulting log likelihoods. Both train-ing techniques were initialized with identical Gaussiannoise, and trained on the same data, which accountsfor the similarity of individual receptive fields foundby the two algorithms. The average log likelihood ofthe model after parameter estimation via MPF was−120.61 nats, while the average log likelihood afterestimation via maximum likelihood was −120.33 nats.The receptive fields resulting from training under bothtechniques are shown in Figure 5. MPF minimizationwas performed by alternating steps of updating theconnectivity function g (xj ,xi) using a Hamiltoniandynamics based scheme, and minimizing the objectivefunction in Equation 15 via LBFGS for fixed g (xj ,xi).This is described in more detail in Appendix E.

5. Summary

We have presented a novel, general purpose frame-work, called minimum probability flow learning(MPF), for parameter estimation in probabilistic mod-els that outperforms current techniques in both learn-ing time and accuracy. MPF works for any paramet-ric model without hidden state variables, includingthose over both continuous and discrete state spacesystems, and it avoids explicit calculation of the par-tition function by employing deterministic dynamicsin place of the slow sampling required by many ex-isting approaches. Because MPF provides a simpleand well-defined objective function, it can be mini-mized quickly using existing higher order gradient de-scent techniques. Furthermore, the objective functionis convex for models in the exponential family, ensur-ing that the global minimum can be found with gra-dient descent in these cases. MPF was inspired bythe minimum velocity approach developed by Movel-lan, and it reduces to that technique as well as to scorematching and some forms of contrastive divergence un-der suitable choices for the dynamics and state space.We hope that this new approach to parameter estima-tion will enable probabilistic modeling for previouslyintractable problems.


Acknowledgments

We would like to thank Javier Movellan, TamaraBroderick, Miroslav Dudık, Gasper Tkacik, Robert E.Schapire, William Bialek for sharing work in progress anddata; Ashvin Vishwanath, Jonathon Shlens, Tony Bell,Charles Cadieu, Nicole Carlson, Christopher Hillar, KilianKoepsell, Bruno Olshausen and the rest of the RedwoodCenter for many useful discussions; and the James S. Mc-Donnell Foundation (JSD, PB, JSD) and the Canadian In-stitute for Advanced Research - Neural Computation andPerception Program (JSD) for financial support.

Appendices Available athttp://redwood.berkeley.edu/jascha/.

References

Ackley, D H, Hinton, G E, and Sejnowski, T J. A learningalgorithm for Boltzmann machines. Cognitive Science, 9(2):147–169, 1985.

Bell AJ, Sejnowski TJ. An information-maximization ap-proach to blind separation and blind deconvolution.Neural Computation 1995; vol. 7:1129-1159, 1995.

Besag, J. Statistical analysis of non-lattice data. TheStatistician, 24(3), 179-195, 1975.

Brush, S G. History of the Lenz-Ising model. Reviews ofModern Physics, 39(4):883–893, Oct 1967.

Carreira-Perpinan, M A and Hinton, G E. On contrastivedivergence (CD) learning. Technical report, Dept. ofComputer Science, University of Toronto, 2004.

Gutmann, M and Hyvarinen, A. Noise-contrastive esti-mation: A new estimation principle for unnormalizedstatistical models. Proc. Int. Conf. on Artificial Intelli-gence and Statistics (AISTATS2010), 2010.

Hateren, J. H. van and Schaaf, A. van der. Independentcomponent filters of natural images compared with sim-ple cells in primary visual cortex. Proceedings: BiologicalSciences, 265(1394):359–366, Mar 1998.

Haykin, S. Neural networks and learning machines; 3rdedition. Prentice Hall, 2008.

Hinton, Geoffrey E, Osindero, Simon, and Teh, Yee-Whye.A fast learning algorithm for deep belief nets. NeuralComputation, 18(7):1527–1554, Jul 2006. doi: 10.1162/neco.2006.18.7.1527.

Hyvarinen, A. Estimation of non-normalized statisticalmodels using score matching. Journal of Machine Learn-ing Research, 6:695–709, 2005.

Hyvarinen, A. Connections between score matching, con-trastive divergence, and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural Net-works, Jan 2007.

Jaakkola, T and Jordan, M. A variational approach toBayesian logistic regression models and their extensions.Proceedings of the Sixth International Workshop on Ar-tificial Intelligence and Statistics, Jan 1997.

Kappen, H and Rodrıguez, F. Mean field approach tolearning in Boltzmann machines. Pattern RecognitionLetters, Jan 1997.

Lyu, S. Interpretation and generalization of score match-ing. The proceedings of the 25th conference on uncer-rtainty in artificial intelligence (UAI*90), 2009.

Lyu, S. Personal communication. 2011.

MacKay, D. Failures of the one-step learning algorithm.Failures of the one-step learning algorithm, Jan 2001.

Macke, J and Gerwinn, S. Personal communication. 2009.

Movellan, J R. A minimum velocity approach to learning.unpublished draft, Jan 2008a.

Movellan, J R. Contrastive divergence in gaussian diffu-sions. Neural Computation, 20(9):2238–2252, 2008b.

Movellan, J R and McClelland, J L. Learning continuousprobability distributions with symmetric diffusion net-works. Cognitive Science, 17:463–496, 1993.

Pathria, R. Statistical Mechanics. Butterworth Heine-mann, Jan 1972.

Schmidt, M. minfunc.http://www.cs.ubc.ca/ schmidtm/Software/minFunc.html,2005.

Schneidman, E, 2nd, M J Berry, Segev, R, and Bialek,W. Weak pairwise correlations imply strongly corre-lated network states in a neural population. Nature,440(7087):1007–12, 2006.

Shlens, J, Field, G D, Gauthier, J L, Grivich, M I, Petr-usca, D, Sher, A, Litke, A M, and Chichilnisky, E J.The structure of multi-neuron firing patterns in primateretina. J. Neurosci., 26(32):8254–66, 2006.

Sohl-Dickstein, J and Olshausen, B. A spatial derivation ofscore matching. Redwood Center Technical Report, 2009.

Swendsen, R.H. and Wang, J.S. Nonuniversal critical dy-namics in Monte Carlo simulations. Physical Review Let-ters, 58(2):86–88, 1987. ISSN 1079-7114.

Tanaka, T. Mean-field theory of Boltzmann machine learn-ing. Physical Review Letters E, Jan 1998.

Welling, M and Hinton, G. A new learning algorithm formean field Boltzmann machines. Lecture Notes in Com-puter Science, Jan 2002.

Yuille, A. The convergence of contrastive divergences. De-partment of Statistics, UCLA. Department of StatisticsPapers., 2005.

http://redwood.berkeley.edu/jascha/


APPENDICES

A Taylor Expansion of KL Divergence

K (θ) ≈ DKL

(p(0)||p(t) (θ)

) ∣∣∣t=0

+ ε∂DKL

(p(0)||p(t) (θ)

)∂t

∣∣∣t=0

(A-1)

= 0 + ε∂DKL

(p(0)||p(t) (θ)

)∂t

∣∣∣t=0

(A-2)

= ε∂

∂t

(∑i∈D

p(0)i log

p(0)i

p(t)i

)∣∣∣∣∣0

(A-3)

= −ε∑i∈D

p(0)i

p(0)i

∂p(t)i

∂t

∣∣∣∣∣0

(A-4)

= −ε∑i∈D

∂p(t)i

∂t

∣∣∣∣∣0

(A-5)

= −ε

(∂

∂t

∑i∈D

p(t)i

)∣∣∣∣∣0

(A-6)

= −ε ∂

∂t

(1−

∑i/∈D

p(t)i

)∣∣∣∣∣0

(A-7)

= ε∑i/∈D

∂p(t)i

∂t

∣∣∣∣∣0

(A-8)

= ε∑i/∈D

∑j∈D

Γijp(0)j (A-9)

=ε

|D|∑i/∈D

∑j∈D

Γij , (A-10)

where we used the fact that∑i∈D p

(t)i +

∑i/∈D p

(t)i = 1. This implies that the

rate of growth of the KL divergence at time t = 0 equals the total initial flowof probability from states with data into states without.

B Convexity

As observed by Macke and Gerwinn (Macke & Gerwin, 2009), the MPF objectivefunction is convex for models in the exponential family.

1

We wish to minimize

K =∑i∈D

∑j∈DC

Γjip(0)i . (B-1)

K has derivative

∂K

∂θm=∑i∈D

∑j∈Dc

(∂Γij∂θm

)p(0)i (B-2)

=1

2

∑i∈D

∑j∈Dc

Γij

(∂Ej∂θm

− ∂Ei∂θm

)p(0)i , (B-3)

and Hessian

∂2K

∂θm∂θn=

1

4

∑i∈D

∑j∈Dc

Γij

(∂Ej∂θm

− ∂Ei∂θm

)(∂Ej∂θn

− ∂Ei∂θn

)p(0)i

+1

2

∑i∈D

∑j∈Dc

Γij

(∂2Ej∂θm∂θn

− ∂2Ei∂θm∂θn

)p(0)i . (B-4)

The first term is a weighted sum of outer products, with non-negative weights14Γijp

(0)i , and is thus positive semidefinite. The second term is 0 for models in

the exponential family (those with energy functions linear in their parameters).Parameter estimation for models in the exponential family is therefore con-

vex using minimum probability flow learning.

C Score matching

Score matching, developed by Aapo Hyvarinen [Hyvarinen(2005)], is a methodthat learns parameters in a probabilistic model using only derivatives of theenergy function evaluated over the data distribution (see Equation (C-5)). Thissidesteps the need to explicitly sample or integrate over the model distribu-tion. In score matching one minimizes the expected square distance of the scorefunction with respect to spatial coordinates given by the data distribution fromthe similar score function given by the model distribution. A number of con-nections have been made between score matching and other learning techniques[Hyvarinen(2007a), Sohl-Dickstein & Olshausen(2009)Sohl-Dickstein and Olshausen,Movellan(2008), Lyu(2009)]. Here we show that in the correct limit, MPF alsoreduces to score matching.

For a d-dimensional, continuous state space, we can write the MPF objectivefunction as

KMPF =1

N

∑x∈D

∫ddy Γ(y, x)

=1

N

∑x∈D

∫ddy g(y, x)e(E(y|θ)−E(x|θ)), (C-1)

2

where the sum∑x∈D is over all data samples, and N is the number of samples

in the data set D. Now we assume that transitions are only allowed from statesx to states y that are within a hypercube of side length ε centered around x instate space. (The master equation will reduce to Gaussian diffusion as ε→ 0.)Thus, the function g(y, x) will equal 1 when y is within the x-centered cube(or x within the y-centered cube) and 0 otherwise. Calling this cube Cε, andwriting y = x+ α with α ∈ Cε, we have

KMPF =1

N

∑x∈D

∫Cε

ddα e(E(x+α|θ)−E(x|θ)). (C-2)

If we Taylor expand in α to second order and ignore cubic and higher terms, weget

KMPF ≈1

N

∑x∈D

∫Cε

ddα (1)

− 1

N

∑x∈D

∫Cε

ddα1

2

d∑i=1

αi∇xiE(x|θ)

+1

N

∑x∈D

∫Cε

ddα1

4

(1

2

[ d∑i=1

αi∇xiE(x|θ)]2

−d∑

i,j=1

αiαj∇xi∇xjE(x|θ)

). (C-3)

This reduces to

KMPF ≈1

N

∑x∈D

[εd +

1

4

(1

2

2

3εd+2

d∑i=1

[∇xiE(x|θ)

]2

− 2

3εd+2

d∑i=1

∇2xiE(x|θ)

)], (C-4)

which, removing a constant offset and scaling factor, is exactly equal to thescore matching objective function,

KMPF ∼1

N

∑x∈D

[1

2∇E(x|θ) · ∇E(x|θ)−∇2E(x|θ)

](C-5)

= KSM. (C-6)

Score matching is thus equivalent to MPF when the connectivity function g(y, x)is non-zero only for states infinitesimally close to each other. It should be notedthat the score matching estimator has a closed-form solution when the modeldistribution belongs to the exponential family [Hyvarinen(2007b)], so the samecan be said for MPF in this limit.

3

D Sampling the connectivity function Γij

Here we extend MPF to allow the connectivity function Γij to be sampled ratherthan set via a deterministic scheme. Since Γ is now sampled, we modify detailedbalance to demand that, averaging over the choices for Γ, the net flow betweenpairs of states is 0, ⟨

Γji p(∞)i (θ)

⟩=

⟨Γij p

(∞)j (θ)

⟩(D-1)

〈Γji〉 p(∞)i (θ) = 〈Γij〉 p(∞)

j (θ) , (D-2)

where the ensemble average is over the connectivity scheme for Γ. We describethe connectivity scheme via a proposal distribution gij , such that the probabilityof there being a connection from state j to state i at any given moment is gij .We also introduce a function Fij , which provides the value Γij takes on when aconnection occurs from j to i. That is, it is the probability flow rate when flowoccurs -

〈Γij〉 = gijFij . (D-3)

Detailed balance now becomes

gjiFji p(∞)i (θ) = gijFij p

(∞)j (θ) . (D-4)

Solving for F we find

FijFji

=gjigij

p(∞)i (θ)

p(∞)j (θ)

=gjigij

exp [Ej (θ)− Ei (θ)] . (D-5)

F is underconstrained by the above equation. Motivated by symmetry, wechoose as the form for the (non-zero, non-diagonal) entries in F

Fij =

(gjigij

) 12

exp

[1

2(Ej (θ)− Ei (θ))

]. (D-6)

Γ is now populated as

rij ∼ rand [0, 1) (D-7)

Γij =

−∑k 6=i Γki i = j

Fij rij < gij and i 6= j0 rij ≥ gij and i 6= j

. (D-8)

Similarly, its average value can be written as

〈Γij〉 = gij

(gjigij

) 12

exp

[1

2(Ej (θ)− Ei (θ))

](D-9)

= (gijgji)12 exp

[1

2(Ej (θ)− Ei (θ))

]. (D-10)

4

So, we can use any connectivity scheme g in learning. We just need to scale

the non-zero, non-diagonal entries in Γ by(gjigij

) 12

so as to compensate for the

biases introduced by the connectivity scheme.The full MPF objective function in this case is

K =∑j∈D

∑i/∈D

gij

(gjigij

) 12

exp

[1

2(Ej − Ei)

](D-11)

where the inner sum is found by averaging over samples from gij .

E Continuous state space learning with the con-nectivity function set via Hamiltonian MonteCarlo

Choosing the connectivity matrix gij for Minimum Probability Flow Learning isrelatively straightforward in systems with binary or discrete state spaces. Nearlyany nearest neighbor style scheme seems to work quite well. In continuousstate spaces q ∈ Rd however, connectivity functions g (qi,qj) based on nearestneighbors prove insufficient. For instance, if the non-zero entries in g (qi,qj)are drawn from an isotropic Gaussian centered on qj , then several hundrednon-zero g (qi,qj) are required for every value of qj in order to achieve effectiveparameter estimation in some fairly standard problems, such as receptive fieldestimation in Independent Component Analysis [Bell AJ(1995)].

Qualitatively, we desire to connect every data state qj ∈ D to the nondata states qi which will be most informative for learning. The most informa-tive states are those which have high probability under the model distributionp(∞) (q). We therefore propose to populate g (qi,qj) using a Markov transitionfunction for the model distribution. Borrowing techniques from HamiltonianMonte Carlo [Neal(2010)] we use Hamiltonian dynamics in our transition func-tion, so as to effectively explore the state space.

E.1 Extending the state space

In order to implement Hamiltonian dynamics, we first extend the state space toinclude auxiliary momentum variables.

The initial data and model distributions are p(0) (q) and

p(∞) (q; θ) =exp (−E (q; θ))

Z (θ). (E-1)

with state space q ∈ Rd. We introduce auxiliary momentum variables v ∈ Rd foreach state variable q, and call the extended state space including the momentumvariables x = {q,v}. The momentum variables are given an isotropic gaussian

5

distribution,

p (v) =exp

(− 1

2vTv)

√2π

, (E-2)

and the extended data and model distributions become

p(0) (x) = p(0) (q) p (v) (E-3)

= p(0) (q)exp

(− 1

2vTv)

√2π

(E-4)

p(∞) (x; θ) = p(∞) (q; θ) p (v) (E-5)

=exp (−E (q; θ))

Z (θ)

exp(− 1

2vTv)

√2π

(E-6)

=exp (−H (x; θ))

Z (θ)√

2π(E-7)

H (x; θ) = E (q; θ) +1

2vTv. (E-8)

The initial (data) distribution over the joint space x can be realized by drawinga momentum v from a uniform Gaussian distribution for every observation q inthe dataset D.

E.2 Defining the connectivity function g (xi,xj)

We connect every state xj to all states which satisfy one of the following 2criteria,

1. All states which share the same position qj , with a quadratic falloff ing (xi,xj) with the momentum difference vi − vj .

2. The state which is reached by simulating Hamiltonian dynamics for afixed time t on the system described by H (x; θH), and then negatingthe momentum. Note that the parameter vector θH is used only for theHamiltonian dynamics.

More formally,

g (xi,xj) = δ (qi − qj) exp(− ||vi − vj ||22

)+ δ (xi −HAM (xj ; θH)) (E-9)

where if x′ = HAM (x; θH), then x′ is the state that results from integratingHamiltonian dynamics for a time t and then negating the momentum. Becauseof the momentum negation, x = HAM (x′; θH), and g (xi,xj) = g (xj ,xi).

6

E.3 Discretizing Hamiltonian dynamics

It is generally impossible to exactly simulate the Hamiltonian dynamics forthe system described by H (x; θH). However, if HAM (x; θH) is set to simulateHamiltonian dynamics via a series of leapfrog steps, it retains the importantproperties of reversibility and phase space volume conservation, and can be usedin the connectivity function g (xi,xj) in Equation E-9. In practice, therefore,HAM (x; θH) involves the simulation of Hamiltonian dynamics by a series ofleapfrog steps.

E.4 MPF objective function

The MPF objective function for continuous state spaces and a list of observationsD is

K (θ;D, θH) =∑xj∈D

∫g (xi,xj)

exp

(1

2[H (xj ; θ)−H (xi; θ)]

)dxi. (E-10)

For the connectivity function g (xi,xj) given in Section E.2, this reduces to

K (θ;D, θH) =∑xj∈D

∫exp

(− ||vi − vj ||22

)exp

(1

2

[1

2vTj vj −−

1

2vTi vi

])dvi

+∑xj∈D

exp

(1

2[H (xj ; θ)−H (HAM (xj ; θH) ; θ)]

). (E-11)

Note that the first term does not depend on the parameters θ, and is thus justa constant offset which can be ignored during optimization. Therefore, we cansay

K (θ;D, θH) ∼∑xj∈D

exp

(1

2[H (xj ; θ)−H (HAM (xj ; θH) ; θ)]

). (E-12)

Parameter estimation is performed by finding the parameter vector θ whichminimizes the objective function K (θ;D, θH),

θ = argminθ

K (θ;D, θH) . (E-13)

7

E.5 Iteratively improving the objective function

The more similar θH is to θ, the more informative g (xi,xj) is for learning.If θH and θ are dissimilar, then many more data samples will be required inD to effectively learn. Therefore, we iterate the following procedure, whichalternates between finding the θ which minimizes K (θ;D, θH), and improving

θH by setting it to θ,

1. Set θt+1 = argminθK (θ;D, θtH)

2. Set θt+1H = θt+1

θt then represents a steadily improving estimate for the parameter values whichbest fit the model distribution p(∞) (q; θ) to the data distribution p(0) (q), de-scribed by observations D. Practically, step 1 above will frequently be truncatedearly, perhaps after 10 or 100 L-BFGS gradient descent steps.

References

[Bell AJ(1995)] Bell AJ, Sejnowski TJ. An information-maximization approachto blind separation and blind deconvolution. Neural Computation 1995; vol.7:1129-1159, 1995.

[Hyvarinen(2005)] Hyvarinen, A. Estimation of non-normalized statistical mod-els using score matching. Journal of Machine Learning Research, 6:695–709,2005.

[Hyvarinen(2007a)] Hyvarinen, A. Connections between score matching, con-trastive divergence, and pseudolikelihood for continuous-valued variables.IEEE Transactions on Neural Networks, Jan 2007a.

[Hyvarinen(2007b)] Hyvarinen, A. Some extensions of score matching. Compu-tational statistics & data analysis, 51(5):2499–2512, 2007b. ISSN 0167-9473.

[Lyu(2009)] Lyu, S. Interpretation and generalization of score matching. Theproceedings of the 25th conference on uncerrtainty in artificial intelligence(UAI*90), 2009.

[Movellan(2008)] Movellan, J R. A minimum velocity approach to learning.unpublished draft, Jan 2008.

[Neal(2010)] Neal, Radford M. Mcmc using hamiltonian dynamics. Handbookof Markov Chain Monte Carlo, Jan 2010. sections 5.2 and 5.3 for langevindynamics.

[Sohl-Dickstein & Olshausen(2009)Sohl-Dickstein and Olshausen] Sohl-Dickstein, J and Olshausen, B. A spatial derivation of score matching.Redwood Center Technical Report, 2009.

8

minimum probability flow learning

Documents