102-202-1-pb

Embedded probabilistic programming in Clojure

Nils BertschingerMax-Planck Institute for Mathematics in the Sciences

Leipzig, [email protected]

ABSTRACTProbabilistic programming is a powerful tool to specify prob-abilistic models directly in terms of a computer program.Here, I present a library that allows to embed probabilisticcomputations into Clojure. Automatic tracking of depen-dencies between probabilistic choice points enables an effi-cient way to sample from the distribution corresponding toa probabilistic program.

1. INTRODUCTIONNowadays, many problems of artificial intelligence are for-

mulated as probabilistic inference. Also the machine learn-ing community makes heavy use of probabilistic models [2].Thanks to the steady increase of computing power and algo-rithmic advances it is now feasible to apply such models toreal world data from various domains. Many different typesof inference algorithms are available, but most of them fallinto just a few well-studied and understood classes, such asGibbs sampling, Variational Bayes etc. Even if such stan-dard algorithms are used, still a lot of hand-crafting is re-quired to turn a probabilistic model into code.

Probabilistic programming tries to remedy this problem,by providing specialized programming languages for express-ing probabilistic models. The inference engine is hiddenfrom the user and can handle a large class of models ina generic way. WinBUGS [6] is a popular example of alanguage for specifying Bayesian networks which are thensolved via Gibbs sampling. Instead of designing a customdomain specific language (DSL) for probabilistic computa-tions, it is also possible to extent existing languages withprobabilistic semantics. This has been done mainly for logicand functional programming languages [3, 8]. Especially, inthe case of functional programming languages, rather light-weight embeddings have been developed [5, 13]. Further-more, the structure of the embedding is well understood interms of the probability monad [11].

Here, I present a shallow embedding of probabilistic pro-gramming into Clojure, a modern Lisp dialect running on

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 2012 ACM X-XXXXX-XX-X/XX/XX ...$10.00.

the JVM. The implementation is based on the bher com-piler [13], but does not require a transformational compila-tion step. Instead, it utilizes the dynamic features of the hostlanguage for a seamless and efficient embedding. As the bhercompiler, my library uses the general Metropolis-Hastingsalgorithm (Sec. 2.1) to draw samples from a probabilisticprogram. In contrast to previous implementations, depen-dencies between the probabilistic choice points encounteredduring a run of the program are tracked (Sec. 3). This al-lows to speed up sampling by exploiting conditional inde-pendence relations. Sec. 4 illustrates the resulting efficiencyon an example application. Finally, Sec. 5 concludes withsome general comments on the development of this library.

2. PROBABILISTIC PROGRAMMINGA functional program consists of a sequence of functions

which are applied to some input values in order to compute adesired output. Each function in the program can be under-stood as a mathematical function, i.e. it produces the sameoutput when given the same input values. A function withthis property, that the output can only depend on its inputs,is called pure. The semantics of a functional program isthus given by a mapping from inputs to outputs. Probabilis-tic programs extend the notion of a function to probabilisticfunctions which map inputs to a distribution over their out-puts. A probabilistic program thus specifies a (conditional)probability distribution instead of a deterministic function.

Probabilistic functions are not pure1, in the sense thatthey could return different values drawn from a fixed dis-tribution for the same input. Thus, a run of a probabilisticprogram gives rise to a computation tree where each noderepresents a basic random choice, such as flipping a coin.The children correspond to the possible outcomes of thechoice and the edges of the tree are weighted with the prob-ability of the corresponding choice (see Fig. 1 for a simpleexample). Each path through the computation tree corre-sponds to a possible realization of the probabilistic programand will be called a trace in the following. The probability ofsuch a trace is simply the product of all edge weights alongthe path.

Similar trees also arise from non-deterministic operationswhich have often been embedded into functional languages[1]. The main difference is that probabilistic choices areweighted and the probabilities of different choices add upin the final result. Therefore, the common practice to im-plement non-determinism, i.e. following just one possible

1In particular, the semantics of a probabilistic programchanges when memoization is introduced.

trace and back-tracking if no solution could be found, is notsuitable to evaluate a probabilistic program. Instead, eitherall possible traces have to be considered or an approximatescheme to sample from the probability distribution corre-sponding to the program is necessary.

2.1 Metropolis-Hastings samplingMetropolis-Hastings is a general method to draw sam-

ples from a probability distribution. Since it only needsto evaluate ratios of probabilities, it is particularly usefulif the probabilities can only be calculated up to normaliza-tion. Metropolis-Hastings is a Markov-Chain Monte-Carlo(MCMC) method which, instead of directly sampling fromthe desired distribution p(x), constructs a Markov chainp(x|x) with a unique invariant distribution p(x) (see [2] fordetails). Assume that the current sample is x and p(x) canbe evaluated up to normalization. Then, a proposal distri-bution q(x|x) is used to generate a new candidate sample.x is then accepted as a new sample with probability

min

{p(x)q(x|x)p(x)q(x|x) , 1

}.

Otherwise the old sample x is retained. It is well knownthat this defines a Markov chain with invariant distributionp(x). The efficiency of this algorithm depends crucially onthe proposal distribution. Ideally the proposed values shouldbe from a distribution close to the desired one, but at thesame time it should be easy to sample from.

In the case of a probabilistic program, each sample corre-sponds to a trace through the computation tree. A globalstore of all probabilistic choices is used to keep track ofthe current trace. Thus, the probabilistic program is pu-rified and deterministically evaluates to the particular tracethrough the computation tree that is recorded in the globalstore. From this, a new proposed trace is constructed usingthe steps illustrated in Fig. 2:

Select a random choice c (with value v) from the oldtrace.

Propose a new value v for this random choice using alocal proposal distribution qc(v

|v).Using the global store of all random choices, a new traceis obtained by re-running the probabilistic program with anew store reflecting the proposed value. Pseudo-code forthis algorithm can be found in [13]. Unfortunately, they donot fully specify how the so-called forward and backwardprobabilities q(x|x) and q(x|x) are obtained. Here, careneeds to be taken that choice points, which only occur inthe new (old) trace but not in the other trace, are accountedfor in the forward (backward) probability.

An important point is how choice points are identifiedbetween the different traces. If they would be consideredas unrelated, each trace would draw completely new choicepoints and the method boils down to rejection sampling.The change becomes more local the more choice points canbe identified between the two traces. In [13] it is argued thatchoice points should be identified according to their struc-tural position in the program. Currently, this is not yetimplemented and instead the user has to explicitly tag eachchoice point. With dynamically bound variables it should bepossible to pass along information about the structural po-sition of each choice point. Implementing a naming schemebased on this idea is left for future work.

2.2 Memoization and conditioningMemoization of choice points, as introduced in [3], can

be implemented by simply changing the tag of a memoizedchoice point to reflect its type and arguments which shouldbe memoized. This way, the same random choice is fetchedfrom the global store if the memoized choice point is calledagain with the same arguments. The form (memo () ) isprovided to memoize a basic choice point on its parametersand possibly further identifying arguments.

Conditioning can be implemented in different ways. Themost general form allows to condition on any predicate andcan be achieved by invalidating the trace, i.e. set its prob-ability to zero, if the predicate evaluates to false. Unfortu-nately, this is a rather inefficient way to implement condi-tioning since an invalid trace is always rejected and thus aform of rejection sampling is obtained. In the common caseof conditioning a basic random choice on a single, specifiedvalue, a better implementation is possible. In this case, thechoice point is fixed at the conditioning value and determin-istically reweights the trace by the probability of this value.

3. NETWORKS OF CHOICE POINTSWhenever a new proposal is evaluated, the whole pro-

gram is run again in order to compute the updated trace.In contrast, a hand-crafted sampling algorithm, e.g. for agraphical model, only recomputes those choice points whichare neighbors of the node where a change is proposed. Es-pecially, in the common case of sparsely connected modelswith many conditional independence relations, this leads tohuge performance gains. Quite often, each sampling stepcan be computed in constant time, independent of the totalnumber of choice points.

Similar observations have lead some researchers away fromprobabilistic functional programming. Instead, they havedeveloped object-oriented frameworks [10, 7] which repre-sent each choice point as an object. The program is thenused to connect such objects and imperatively defines amodel consisting of interconnected choice points. This modelcan then be solved either by exact methods based on beliefpropagation or approximate methods such as Markov-ChainMonte Carlo. This approach has the further advantage thatsubclassing allows the user to provide custom choice pointsand control the proposal distributions that are to be used.Unfortunately, the direct relation between the program andthe model is lost and it usually not possible to specify modelswith a changing topology (e.g. Fig. 1). Thus, the object-oriented implementation is more of a library for Bayesiannetworks (or Markov random fields) than a language exten-sion for embedded probabilistic programming2.

Here, I propose an implementation which keeps track ofthe dependencies between probabilistic choices. Then onlythe actual dependents of a choice points need to be recom-puted if a new value is proposed. This approach is inspiredby reactive programming which allows to set variables viaspreadsheet-like formulas. If a variable changes, all depen-

2A better embedding can be achieved if language constructsare overloaded for choice point objects. This works rathernicely for standard operators on numbers, list, etc. but lookssomewhat clumsy for special forms, e.g. if. Furthermore,a small additional overhead is introduced for every opera-tion whereas, ideally, the deterministic parts of the programshould be able to run at full speed.

A: B:

(let [c1 (flip 0.5)c2 (flip (if c1 0.8 0.4))c3 (if (= c1 c2)

(flip 0.3)true)]

(and c1 c3))

c1

c2

c3

false

false

0.7

false

true0.3

false

0.6

c3

false

true0.4

false

0.5 c2

c3

true

false

0.2

c3

false

false

0.7

true

true0.3

true0.8

true0.5

Figure 1: A simple probabilistic program and its computation tree. A: The probabilistic program in Clo-juresque pseudo-code: flip is a basic probabilistic function which returns true with the specified probabilityand false otherwise. B: The computation tree corresponding to the program on the left. Note that thenumber of random choices depends on the outcome of previous choices.

old new

c1

c2

c3

false

false

0.7

false

true0.3

false

0.6

c3

false

true0.4

false

0.5 c2

c3

true

false

0.2

c3

false

false

0.7

true

true0.3

true0.8

true0.5

c1

c2

c3

false

false

0.7

false

true0.3

false

0.6

c3

false

true0.4

false

0.5 c2

c3

true

false

0.2

c3

false

false

0.7

true

true0.3

true0.8

true

0.5

Figure 2: Example of proposing a new trace, by changing the value of the first random choice c1. The valueof choice c2 can be reused, but has to be reweighted with the new probability. The random choice at c3 iscreated anew when moving from the old to the new trace.

dent formulas are then updated automatically. This ap-proach has for example been implemented in Common Lisp(Cells [12]) and Python (Trellis) in order to simplify thespecification of user interfaces. In both implementations,dependencies between variables are tracked automaticallyand dynamically changed if different conditional branchesare followed.

In order to do this, choice points are specified explicitlyand a dependency with another choice point is establishedwhenever their value is accessed, with (gv ).Using this approach the example from Fig. 1 can be writtenas follows:

(let [c1 (flip-cp :c1 0.5)c2 (flip-cp :c2

(if (gv c1) 0.8 0.4))c3 (det-cp :c3

(if (= (gv c1) (gv c2))(gv (flip-cp :c3-a 0.3))true ))]

(det-cp :result(and (gv c1) (gv c3))))

Thus, the programmer can control which parts of the pro-gram are recomputed in case of a proposal. For this to work,the value of a choice point is not extracted in the bindingform, but explicitly accessed with gv. In order to ensurethat all parts of the program which use values of probabilis-tic choices can be recomputed, some additional deterministicchoice points have to be introduced which re-evaluate theirbody whenever any choice point they depend on is changed.Overall, the program can be written in a style that is quiteclose to a full DSL for probabilistic programming. The pro-grammer just has to take care that choice points are alwaysaccessed with gv and only inside other choice points3.

3Clojure is dynamically typed and therefore the compilercannot enforce that these rules are obeyed. Nevertheless,the restrictions on probabilistic programs are bearable andunproblematic in practical programs.

Overall, the resulting programming style is somewhat in-termediate between a functional probabilistic programminglanguage and the object-oriented style mentioned above. Inthe latter approach, objects are used to represent choicepoints and dependencies are explicitly constructed betweenthem via methods that take parent choice points as argu-ments. Thus, the program is not a specification of the prob-abilistic model itself, but is used to construct it as a staticnetwork of choice points. Here, choice points are also rep-resented as a special data structure, but as in a functionalprobabilistic programming the program itself is the proba-bilistic model. The explicit choice points are merely intro-duced in order to speed up recomputations necessary duringMetropolis-Hastings sampling. In addition, having an ex-plicit representation of choice points enables extensibility,as in the object oriented approach. The macro def-prob-cp allows to add custom choice points which support addi-tional probability distributions or use specialized proposaldistributions.

4. EXAMPLETo illustrate the power of probabilistic programming Fig. 3

shows how a Gaussian mixture model with a Dirichlet priorfor the mixture components can be implemented. The pro-gram closely follows the structure of the statistical model:

1. Draw the component weights from a Dirichlet prior.

2. Assign each data point to one of the components.

3. Draw a mean for the corresponding component modelfrom a Gaussian prior.

4. Condition the Gaussian component model on the ob-served data point.

Note, how memoization is used to share parameters of thecomponent models between different data points assigned

A: B:

(defn mixture-memo [comp-labels data]

(let [comp-weights (dirichlet-cp :weights (for [_ comp-labels] 10.0))]

(doseq [[idx point] (indexed data)]

(let [comp (discrete-cp [:comp idx] (zipmap comp-labels (gv comp-weights )))

comp-mu (memo [:mu idx] (gaussian-cp :mu 0.0 10.0) (gv comp ))]

(cond-data (gaussian-cp [:obs idx] (gv comp-mu) 1.0) point )))

(det-cp :mixture

[(into {} (for [comp comp-labels]

[comp (gv (memo [:mu comp] (gaussian-cp :mu 0.0 10.0) comp ))]))

(gv comp-weights )])))-10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0

x

0.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

0.200

0.225

Dens

ity

Figure 3: A: Probabilistic Clojure program implementing a Gaussian mixture model. B: Sample from theposterior overlaid on a histogram of the data set.

to the same mixture component. Furthermore, the exampleillustrates that conditioning is a side-effects and acts like anassignment.

The following form call to a library function

(metropolis-hastings-sampling(fn [] (mixture-memo [:a :b :c] data )))

and returns a sequence of samples from the posterior distri-bution of the model with three mixture components, labeled:a, :b and :c, when conditioned on a vector of observed datapoints.

Panel B of Fig. 3 shows a single sample from the poste-rior overlaid on a histogram of the data points. Here, a dataset consisting of 750 points drawn from a mixture of threeGaussians has been used. Since each Metropolis-Hastingsstep only needs to recompute the choice points which areeffected by the proposal, the program scales to considerablylarger data sets. Preliminary tests show that it is slightlyfaster than the Python library PyMC [9] which also imple-ments MCMC sampling.

5. CONCLUSIONSAdjusting the language towards different semantics, such

as non-determinism or logic programming, is a common wayto design Lisp programs (see for example [4]). The pre-sented Clojure library demonstrates that probabilistic pro-gramming can as easily be integrated into Lisp. Indeed,the code required for running the example from Sec. 4 issurprisingly compact with just below 1000 LoC. The imple-mentation utilizes dynamically bound variables and syntac-tic extensions via macros4 to achieve a seamless embeddingof the probabilistic programming model.

In addition, Clojures immutable data structures allow foran easy and efficient implementation of Metropolis-Hastingssampling: Reverting to the old trace is always possible with-out creating a copy of the global store first. Thus, they serveas efficient difference data structures which had to be hand-crafted in [8] for this purpose.

During development the dynamic nature of Clojure wasof great help. I implemented different prototypes, whichhelped to clarify unforeseen corner cases making the track-ing of dependencies somewhat tricky. Also in this case, theimmutable data structures prevented bugs due to state up-dates and thus simplified the task of keeping the dependen-cies consistent.

Overall, probabilistic programming is a powerful tool tospecify probabilistic models. Many different types of mod-

4Since Clojure shares these features with other languages ofthe Lisp family, porting the library to Scheme or CommonLisp is quite possible.

els, e.g. mixture models, regression models etc., can be im-plemented with ease and conciseness. Thus, even thoughthe performance falls short of hand-crafted algorithms, it israther useful for rapid prototyping of probabilistic modelsand the presented library now allows to play with differenttypes of probabilistic models directly in Lisp.

6. REFERENCES[1] H. Abelson and G. J. Sussman. Structure and

Interpretation of Computer Programs. MIT Press, 2edition, 1996.

[2] C. M. Bishop. Pattern Recognition and MachineLearning. Springer, 2006.

[3] N. D. Goodman, V. K. Mansinghka, D. M. Roy,K. Bonawitz, and J. B. Tenenbaum. Church: Alanguage for generative models. In Uncertainty inArtificial Intelligence, pages 220229, 2008.

[4] P. Graham. On Lisp: Advanced Techniques forCommon Lisp. Prentice Hall, 1993.

[5] O. Kiselyov and C. Shan. Embedded probabilisticprogramming. In W. Taha, editor, IFIP workingconference on domain-specific languages, LNCS 5658,pages 360384. Springer, 2009.

[6] D. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter.Winbugs a bayesian modelling framework: concepts,structure, and extensibility. Statistics and Computing,10:325337, 2000.

[7] A. McCallum, K. Schultz, and S. Singh. Factorie:Probabilistic programming via imperatively definedfactor graphs. In Advances on Neural InformationProcessing Systems (NIPS), 2009.

[8] B. Milch and S. Russell. General-purpose mcmcinference over relational structures. In 22ndConference on Uncertainty in Artificial Intelligence,pages 349358, 2006.

[9] A. Patil, D. Huard, and C. J. Fonnesbeck. Pymc:Bayesian stochastic modelling in python. Journal ofStatistical Software, 35(4):181, 2010.

[10] A. Pfeffer. Figaro: An object-oriented probabilisticprogramming language. Technical report, CharlesRiver Analytics, 2009.

[11] N. Ramsey and A. Pfeffer. Stochastic lambda calculusand monads of probability distributions. InProceedings of the 29th ACM SIGPLAN-SIGACTsymposium on Principles of programming languages,POPL 02, pages 154165, New York, NY, USA, 2002.ACM.

[12] K. Tilton. The cells manifesto.http://smuglispweeny.blogspot.com/2008/02/cells-manifesto.html.

[13] D. Wingate, A. Stuhlmueller, and N. D. Goodman.Lightweight implementations of probabilisticprogramming languages via transformationalcompilation. Proceedings of the 14th internationalconference on Artificial Intelligence and Statistics,2011.

102-202-1-pb

Documents