paper ...papers.cnl.salk.edu/pdfs/model reduction for... · rtpr tprc r s f s ≈ ΔΔ[] Δ Δ[]...

Phys. Biol. 12 (2015) 045005 doi:10.1088/1478-3975/12/4/045005

PAPER

Model reduction for stochastic CaMKII reaction kinetics in synapsesby graph-constrained correlation dynamics

Todd Johnson1, TomBartol2, Terrence Sejnowski2 and EricMjolsness11 Department of Computer Science, University of California Irvine CA92697,USA2 Salk Institute, La Jolla CA,USA

E-mail: [email protected], [email protected], [email protected] and [email protected]

Keywords: model reduction, stochastic reaction networks, rule-based modeling, graph-constrained correlation dynamics, Boltzmannlearning, CaMKII, spike timing dependent plasticity

Supplementarymaterial for this article is available online

AbstractA stochastic reaction networkmodel of Ca2+ dynamics in synapses (Pepke et al PLoS Comput. Biol. 6e1000675) is expressed and simulated using rule-based reactionmodeling notation in dynamicalgrammars and inMCell. Themodel tracks the response of calmodulin andCaMKII to calcium influxin synapses. Data fromnumerically intensive simulations is used to train a reducedmodel that, out ofsample, correctly predicts the evolution of interaction parameters characterizing the instantaneousprobability distribution overmolecular states in themuch larger fine-scalemodels. The novelmodelreductionmethod, ‘graph-constrained correlation dynamics’, requires a graph of plausible statevariables and interactions as input. It parametrically optimizes a set of constant coefficients appearingin differential equations governing the time-varying interaction parameters that determine allcorrelations between variables in the reducedmodel at any time slice.

1. Introduction

Given a stochastic reaction network, even one speci-fied by high-level ‘parameterized reactions’ or ‘rule-based’ notation [2–6], there is a correspondingchemical master equation (CME) for the evolution ofprobability distributions over all possible molecularstates of the system. These states are ultimatelydescribed in terms of discrete-valued random vari-ables. Unfortunately as the number of such randomvariables grows, the number of molecular statesappearing directly in the CME grows exponentially.On the other hand even for a dynamical system that isnonlinear in its observable variables, the CME is a(very large) system of linear differential equations fortime-evolving probabilities. The exponential explo-sion of state space size with number of randomvariables can often be bypassed in sampling-stylesimulation (such as the Gillespie stochastic simulationalgorithm (SSA) [7] and itsmany variants), and also toa lesser extent for reaction rate inference, providedthat enough trajectories are sampled to evaluate arequired expected value. But the sampling approach

requires a lot of computing power to sample enoughtrajectories, and also poses substantial obstacles foranalysis.

The problem of state space growth is compoundedin the case of rule-based stochastic models [2, 4–6]since in that case even the number of molecular spe-cies suffers an exponential growth in terms of naturalproblem size parameters such as the number of bind-ing sites in a molecular complex. Then the state spacedescribed in the master equation grows doubly expo-nentially in such problem size parameters, and it canbe hard to really understand the resulting stochasticdynamical system. And yet, molecular complexes withcombinatorially many states (such as transcriptioncomplexes, signal transduction complexes, and allos-teric enzyme complexes) are ubiquitous in biology, sothe problem cannot simply be avoided. For example,the signal transduction cascade in response to calciuminflux through n-methyl D-aspartate receptors(NMDARs) in synapses, which plays a key role insynaptic plasticity in spatial learning and memory inmice, has such combinatorially generated number oflocal states of molecular complexes involving calcium,

OPEN ACCESS

RECEIVED

31October 2014

REVISED

3April 2015

ACCEPTED FOR PUBLICATION

7April 2015

PUBLISHED

18 June 2015

Content from this workmay be used under theterms of theCreativeCommonsAttribution 3.0licence.

Any further distribution ofthis workmustmaintainattribution to theauthor(s) and the title ofthework, journal citationandDOI.

© 2015 IOPPublishing Ltd

http://dx.doi.org/10.1088/1478-3975/12/4/045005

mailto:[email protected]




http://dx.doi.org/10.1088/1478-3975/12/4/045005

http://crossmark.crossref.org/dialog/?doi=10.1088/1478-3975/12/4/045005&domain=pdf&date_stamp=2015-06-18

http://crossmark.crossref.org/dialog/?doi=10.1088/1478-3975/12/4/045005&domain=pdf&date_stamp=2015-06-18

http://creativecommons.org/licenses/by/3.0



calmodulin, CaMKII and phosphorylation sites, aswill be described in section 3.1 below ([1, 8], and refer-ences therein). Each of themany possible states of eachcomplex is itself a ‘molecular species’ with associatedreaction channels.

To address these problems of model size and statespace size, reduced modelsmay have substantial advan-tages. In general, model reduction replaces a largemodel with a smaller and more tractable model thatapproximates the large model in some relevant aspect.The ideas of model ‘size’, ‘tractability’, ‘approxima-tion’, and ‘relevant aspect’ can be defined in variousways. We will suggest one framework for these defini-tions in section 2.1 below. In sections 2.2 and 2.3below we use this framework to introduce a newmodel reduction method for stochastic biochemicalnetworks which in our target application are also rule-based, though they need not be.

Our method can be viewed as a form of ‘momentclosure’ method as will be explained in section 2.4,which also contains further comparisons to relatedwork. Compared to othermethods ofmoment closurewe seek a much more aggressive reduction in themodel size, as counted by the number of degrees offreedom (chemical or biological variables with dyna-mically changing values) required even in a samplingapproach, such as SSA, to the original unreduced bio-chemical model. This claim is substantiated insections 2.4 and 3.4 below. Such a strategy may beappropriate to the eventual goal of finding usable ‘phe-nomenological’ but mechanistically well-foundedapproximate models of fine-scale subsystems to placewithin yet larger super-system models, for exampleplacing calcium signal pathway reducedmodels withinneuronal-level synaptic plasticity simulations,although we have not yet attempted such an applica-tion of this method. Unlike most but not all other

moment closure approaches, we retain (approxima-tions to) correlations of arbitrarily high order ratherthan truncating them to zero. Our approach appliesnaturally to the case of rule-based models. And per-haps most importantly from the biological point ofview, it is based on a problem-specific graph of possi-ble interactions between key system variables. Such agraph is a natural place to impose human biologicalexpertise on the approximate model reductionmethod.

2. Theory

2.1.Model reduction criteriaFigure 1 illustrates our general setting. The basic idea isthat the results of following the red arrows aroundfrom earlier to later observations, byway of afine-scalepredictive dynamical model as one would do in anordinary simulation, should approximately agree withthe results of following the green arrows aroundthrough a coarse-scale model instead. Since thecoarse-scale model is smaller, following the greenarrows could be cheaper computationally and alsomore amenable to human understanding of thedynamics. To define this possibility technically,figure 1 also includes mappings M and P̂ directlybetween the fine-scale and coarse-scale model statespaces.

All vectors andmaps defined below are assumed tobe defined in the sense of probability distributions, sothat for example the fine-scale system state vector S t( )is a distribution over all possible individual micro-scopic states s. In the deterministic limit that distribu-tion could be a delta function that picks out onewinning microstate. Likewise maps Δ ΔM t, [ ]f etctake distribution vectors to distribution vectors. Theforward time-evolution maps Δ Δt[ ]f and Δ Δt[ ]c are

Figure 1.Commutative diagram formodel reduction. Fine-scale system Swith state S t( ) at time t evolves according to fine-scaledynamics Δ Δt[ ]f (lower horizontal arrow) to somenew state Δ+S t t( ) at time Δ+t t after (finite or infinitesimal) time interval Δt .Likewise, reduced or coarse-scale systemRwith state R t( ) at time t evolves under reduced coarse-scale dynamics Δ Δt[ ]c (upperhorizontal arrow). There is amodel reduction or restrictionmapM (vertical arrows), and a prolongationmap P̂ right-inverse toM.Comparison for approximation (≈)may bemade in some relevant third spaceO, which could specialize (as we assume below) to bethe same as S orR. The black arrow quadrilateral comprising two short paths from S t( ) to Δ+R t t( ) corresponds to approximationequations (1) and (2). Approximation equations (3) and (4) correspond in their left-hand sides to the green highlighted path and intheir right-hand sides to the red highlighted path. In principlemultiple copies of this diagram can be composed, either horizontally orvertically.

2

Phys. Biol. 12 (2015) 045005 T Johnson et al

linear on probability distributions, and thus preservemixtures, but in this paper we do not assume linearityofM or the othermaps introduced below.

Fine-scale and (reduced) course-scale dynamicsare illustrated in figure 1. A model-reduction or‘restriction’ map M should ideally commute, at leastapproximately, with time-evolutionmaps Δ Δt[ ]f andΔ Δt[ ]c ; thus for example

Δ Δ Δ Δ◦ ≈ ◦t M S t M t S t[ ] · ( ) [ ] · ( ), (1)c f

for all S t( ) or more strongly for almost all possible S,which in operator space could be stated as

Δ Δ Δ Δ◦ ≈ ◦t M M t[ ] [ ]. (2)c f

We will omit operator-space variants of the approx-imation statements below but they should be clear ifthe same vector appears at the rightmost end of eachside of the approximation. Here and in this section thesense of approximation ≈ has yet to be defined but itrequires aggregating some measure of error overmicrostates s, r, or o. For deterministic systems, sum-of-squares error over dynamical variables is plausible;for stochastic systems, an asymmetric Kullback–Leibler (K–L) divergence or relative entropy betweentwo distributions over system states is plausible. TheK–L divergence is useful when approximating prob-ability distributions because (a) it measures the extrainformation contained in one distribution beyondwhat is in a second distribution, and (b) it takes itsminimal value, zero, when the two distributions areequal almost everywhere.

Equations (1) and (2) are not entirely satisfactorysince they provide no control over the space in whichapproximation comparisons are to be made. Alter-natively to equation (1), and adopting terminologyused in multigrid/multiscale algorithms [9, 10], onecould introduce a ‘prolongation’map P̂ that is exactlyor approximately right-inverse to M (so that

◦ =M P Iˆ or ◦ ≈M P Iˆ ), and make the commu-tativity comparison in the fine-scale system space, S,rather than course-scale,R. But amore general schemethat encompasses both alternatives is to comparetime-evolved states or probability distributions onstates in a third space of significant ‘observables’, O t( ),as shown, using restriction maps →R S O:S (and itsright powerset inverse or prolongation map PS forwhich ◦ =R P IS S ) and →R R O:R (and its rightpowerset inverse or prolongation map PR for which

◦ =R P IR R if space R is not smaller thanO so that RR

can be surjective) that predict the targeted observablesbased on the current state of each system. Then weseek

Δ ΔΔ Δ

◦ ◦≈ ◦ ◦

R t P O t

R t P O t

[ ] · ( )

[ ] · ( ), (3)R c R

S f S

as illustrated by the red and green three-arrow paths infigure 1. This, or the corresponding operator state-ment in theO space:

Δ Δ Δ Δ◦ ◦ ≈ ◦ ◦R t P R t P[ ] [ ] (4)R c R S f S

is our most general statement of the commutationcondition.

If we initialize =O t R S t( ) · ( )S , and assume forconsistency the triangular commutation relation

◦ =P R MR S , and define the projection operatorΠ = ◦P RS S S, then equation (3) becomes

Δ ΔΔ Δ Π

◦ ◦≈ ◦ ◦

R t M S t

R t S t

[ ] · ( )

[ ] · ( ). (5)R c

S f S

Two special cases are salient for our computationalexperiments. In the special case O = S, which we willuse, then =R IS , =R P̂R , =P IS and =P MR , (noteRR and PR exchange roles so that ◦ ≠R P IR R butinstead ◦ = ◦ =P R M P Iˆ

R R ), and we deduceΠ = IS and the foregoing condition becomes

Δ Δ Δ Δ◦ ◦ ≈P t M S t t S tˆ [ ] · ( ) [ ] · ( ). (6)c f

And in the special case O = R, which we will also use,=R IR , =R MS , =P IR , and =P P̂S , so equation (3)

reverts to

Δ Δ Δ Δ≈ ◦ ◦t R t M t P R t[ ] · ( ) [ ] ˆ · ( ), (7)c f

which is closely related to (1).In all cases some measure of difference or distance

is required to define approximation ≈; such ameasuremay operate directly on microstates s r o, , , or onprobability distribution state vectors S R O, , overthese microstates as we assume below. Particular defi-nitions of ≈will bemade in section 3.3 (forO = R) andappendix (for (O = S). The foregoing considerationsapply to any dynamical system including stochastic,deterministic, and mixed stochastic/determinis-tic ones.

2.2. Fine- and coarse-scale dynamicsTo apply the foregoing framework we need to definefine scale dynamics, coarse scale dynamics, observa-bles, mappings between them, and a sense of approx-imation (‘≈’ in figure 1). As in equation (7) above, wewill report on the results of taking O = R. Theapproximation metrics will be defined in section 3.3.We nowdefinefine and coarse scalemodels.

For a master equation derived from a large fine-scale reaction network (whether rule-based or not) weseek reduced coarse-scale models in the form of aBoltzmann distribution over states at each time point,with successive time points linked by an artificial andapproximating dynamics on the ‘coupling constant’ orinteraction parameters (now time-varying rather thanconstant) appearing in the Boltzmann energy functionformula. Inmachine learning terms, our prolongationmap P̂ is given at each instant in time by a probabilitydistribution on fine-scale variables specified by aMar-kov random field (MRF) [11]. The MRF comprises aset of ‘clique potentials’ or ‘interaction potentials’.Each interaction potential is a function, usually a

3


monomial or polynomial, of just a few random vari-ables. Each such potential function is multiplied by ascalar interaction potential strength, which we will callan ‘interaction parameter’. The sum of all the poten-tials (including their interaction parameter multi-plicative factors) yields the energy function in theBoltzmann distribution. Unlike the potentials, theenergy function depends on all the random variables.The way we apply this standard apparatus is as follows:the MRF random variables are interpreted as the fine-scale variables, and the MRF interaction parametersare taken to be the coarse-scale model dynamical vari-ables. Only these interaction parameters can vary withtime, and in our model they will vary continuously intime. Otherwise, the structure of each interactionpotential is fixed and independent of time. Thus, theenergy function and the MRF model depend on timeonly through the interaction parameters which are thecoarse-scale dynamical variables.

Without the constraint between successive timepoints enforced by coarse-scale dynamics on the inter-action parameters, the classic Boltzmann machinelearning algorithm (BMLA) [12] can be used sepa-rately at each time point to optimize these unknowninteraction parameters to fit samples drawn frommany simulations of the full model. This learningalgorithm optimizes the K–L divergence or relativeentropy between sampled and modeled distributions,thereby defining a sense for the approximation rela-tionship in section 2.1, but only for one instant intime. We slightly generalize the BMLA learning algo-rithm so that it allows for weight-sharing (compo-nents of the μ interaction parameter vector that areconstrained eg. to be equal) and for polynomial inter-action potentials of degree higher than two.

Coupling many such interaction-parameter infer-ence problems together by insisting that the inferredinteraction parameters μ all evolve according toimposed ordinary differential equation (ODE)dynamics, with a further set of learnable model-speci-fying meta-parameters θ defined in section 2.3 below,results in the graph constrained correlation dynamics(GCCD)method presented here and in [13].

A large space of stochastic dynamical systemsΔ ΔS t( , [ ])f which can specialize to deterministic ones

is specified by themaster equation governing (possiblysingular) probability distributions p s t( , ):

=p t

tW p t

d ( )

d· ( ), (8)

where W is some linear operator acting on the statespace of all distributions p over S and obeyingconservation of probability, =W1 · 0. Even thoughthe master equation is linear, its effect on momentssuch as ⟨ ⟩s t( )i p may be highly nonlinear.

For stochastic chemical kinetics the masterequation specializes to theCMEwhich can bewritten:

∑ ∏

∏

= −

× −

−

)(

( )

( )

( )

p n t

k n S

p n S t

n p n t

[ ],

,

( ) [ ], , (9)

t i

r

r

j

j jr

m

i ir

j

j m i

dd

( )

( )

jr

jr

( )

( )

⎡

⎣⎢⎢

⎛⎝⎜⎜

⎞⎠⎟⎟

⎡⎣ ⎤⎦⎛⎝⎜⎜

⎞⎠⎟⎟

⎤

⎦⎥⎥

where n[ ]i is the vector of numbers ni of molecules ofeach type i; also in each reaction r, the stoichiometry ofthe reaction is defined by the following integers: m j

r( )

copies of molecule j are destroyed and n jr( ) are created

resulting in a net change of = −S n mjr

jr

jr( ) ( ) ( ) in the

number of molecules of type j; also the notation n( )m

means the falling factorial −n n m! ( )!, and kr is thereaction rate for reaction number r. This fully definesthe fine-scale stochastic system for a mass-actionchemical reaction network. Using the same notation,the chemical reaction network itself may be expressedas:

∑ ∑∀ ⟶m A n A , (10)r

j

jr

jk

j

ir

i( ) ( )r

where Ai represents reacting chemical species numberi and the sums are to be interpreted chemically ratherthanmathematically.

The Plenum implementation of dynamical gram-mars [2, 4] and the MCell Monte Carlo simulationsoftware [3] can be used to express rule-based modelsand thereby to concisely define combinatorially manyelementary reactants and reactions for molecularcomplexes such as CamKII (which will be introducedin section 3.1 below) and to efficiently simulate them.In addition, spatial diffusion processes can be added.Plenum uses computer algebra to express high-levelbiological models including those in which the num-ber of compartments varies dynamically and hybridstochastic/ODE systems. MCell has strong stochasticspatial capabilities and has been used extensively forsynapse and neuronal simulations.

For coarse-scale approximate models, if weassume that the state space of the reduced model canbe described as the product of state spaces for a fixedset of variables μ= αr { }, then wemay consider coarse-

scale dynamical systems that through prolongation P̂induce an instantaneous fine-scale probability dis-tribution p s t˜( , )defined by some Boltzmann distribu-tion

∑μ μ μ= −α

α αp s t t V s Z t˜( , ) exp ( ) ( ) ( ( )), (11)⎡⎣⎢⎢

⎤⎦⎥⎥

where μZ ( ) normalizes p̃. This formula separates thetime-evolution (which can only affect interactionparameters μα) from the correlation-controlling struc-ture of interactions αV . If there are as many values of αas elements in the full state space of s, then any

4


distribution can be described, but generally we choosea far sparser set of interaction terms. In generalequation (11) has nonzero moments of all orders,though only a few moments μ μ−∂ ∂ =αZlog [ ]⟨ ⟩αV s( ) can be controlled independently by varyingthe μα interaction parameters. This control would beexercised e.g. when one derives equation (11) bymaximizing entropy subject to equality constraints onthese moments ⟨ ⟩αV s( ) and on total probability. Allother moments (which effectively have μ =α′ 0)would fall where they may, following the principle ofconstrained maximum entropy obeyed by the Boltz-mann distribution.

The essential information about the coarse-scaledynamics is contained in equation (11) above andequation (12) in section 2.3 below. In this setting, pro-longation P̂ from coarse to fine is obtained by sam-pling from the Boltzmann distribution μ∣p r t˜( , ). Themodel reduction map M will be defined by statisticalinference, from a sample of S to μ. Unlike the time-evolution maps Δ Δt[ ]f and Δ Δt[ ]c , neither M nor P̂must necessarily be linear on distributions, and in thespecial case of optimization algorithms such as max-imum likelihood inference, maximum a posterioriinference, and BMLA,Mwould be nonlinear due to itsoptimization of an objective function that is not quad-ratic in both p and p̃. In sections 2.3 and 3 we will spe-cialize and then apply the theory to stochasticbiochemical networks.

2.3. Basis functions, smoothing, test casesTo define the coarse-scale stochastic model in termsof the time-evolving Boltzmann distribution ofequation (11), we hypothesize that even thoughany particular sample of the stochastic nonlinearsystem will in general undergo discontinous evolu-tion, the probability distribution governing theensemble of such samples are likely to evolvecontinuously in time (as does the master equationitself) even when projected down to distributionsdescribed by the statistical interaction parameters μ.We therefore further hypothesize continuous anddeterministic ODE dynamics for the μ t( ) interactionparameters:

∑μ μ θ μ= =α α αt

t f t f td

d( ) ( ( )) ( ( )), (12)

A

A A

which is linear in new trainable model meta-para-meters θαA (referred to below as ‘model parameters’,and which must be distinguished from the interac-tion parameters μ) that are constant in time, unlikethe time-varying interaction parameters μ. For basisfunctions μαf ( )A we use bases that arose in elemen-tary solvable examples (two- and four-state bindingmodels mentioned at the end of this subsection anddescribed in [13]):

μ μ

μ μ

∈ ⋃

×α β μ μ

α

μμ

α βμ μ μ μ

−

+−

− +

α α

αα

α β α β

{

( )

( )

f ( ) 1, e , e , ,

, e ,

, e , e , (13)

Ak

c

,

1

1( ) 2

2( ) 2

2

2 2⎫⎬⎭for ∈ …k {1, , 5} and ∈ − … +c { 3, , 3}. Machinelearningmodel selection algorithms (such as the ‘lasso’algorithm of section 3.3 below) can be used toencourage sparsity in the use of bases i.e. in the matrixθ, which in turn favors good generalization perfor-mance.Many other forms for trainableODEdynamicscould be used in this ‘system identification’ subpro-blem, such as (nonlinear in θ) neural networks.

Like theMarkovian equation (8), the deterministicequation (12) is differential in time so that the evolu-tion of the coarse scale dynamical interaction para-meters μ depends only on their state at the currenttime and not directly on their state at earlier times.However as remarked in [14], many consistent non-Markovian stochastic processes can be obtain fromMarkovian ones by integrating out extra degrees offreedom not otherwise needed. Similar phenomenaobtain for differential equations. In GCCD it may bepossible to do this by adding extra ‘hidden’ interactionparameters and/or extra random variables to theGCCDgraph.

As a postprocessing step, the BMLA-learned tra-jectories of μα t( ) interaction parameters could besmoothed in time t by convolutionwith aGaussian in tand then differentiated with respect to time to getμ td d ; what we actually do is the mathematicallyequivalent operation of convolving with the analyticderivative of aGaussian.

The solvable examples used to derive the basisfunctions in equation (13) were: (1) a two-state bind-ing site model in which ligand binds to and unbindsfrom a site, and (2) a four-state, two-site cooperativebinding model, both obeying detailed balance. The K–L divergence minimization algorithm derived in theappendix solved both of these problems correctly tohigh accuracy. Unfortunately, with these basis func-tions, the GCCD K–L divergence minimization algo-rithm exhibited numerical instability and failure toconverge on the realistic CaMKII problem outlinedbelow. It is possible that this problem could be solvedby variations such as more extensive stacking (definedin the appendix), which would allow the use of moretraining data, or a different form for the ODEs such asdifferent basis functions and/or ODEs nonlinear in θ.In particular the ODE right-hand sides could take themathematical form of trainable nonlinear neural net-works. Multilayer neural networks with a final linearlayer would generalize equation (12) to include train-able basis functions. In section 3 below we report onthe results of a different strategy, which is to optimizeapproximation in the O = R or μ space (as in

5


equation (1) or (7)) rather than the O = S space (as inequation (6)).

2.4. PreviousworkIf we multiply the appropriate master equation(equation (9) above) bymonomials in key observablessuch as numbers of molecules of selected species in achemical reaction network, and then sum over allstates, we find ODEs for the time-evolution of variousmoments of the distribution over states. Unfortu-nately these equations do not close: the time derivativeof lower-degree moments depends on the presentvalue of higher-degree moments recursively, generat-ing a countable infinity of coupled differentialequations.

The goal of ‘moment closure’methods [15–27] is toobtain explicit though approximate dynamics for somefinite subset of the first-order moments = ⟨ ⟩C si i p t( ),the second-order moments = ⟨ ⟩C s sij i j p t( ), and higher

moments = ⟨ … ⟩…C s si ik

i i p t( )k k1 1 of a collection of ran-dom variables si under some dynamics of their jointprobability distribution ⃗p s t( , ). Many approximatemoment closure schemes have been developed startingfrom k = 1 mean field theory (systematicaly replacing⟨ ⟩s si j p t( ) with a function ⟨ ⟩ ⟨ ⟩s si p t j p t( ) ( ) of the first-order moments as would be correct for independentdistributions) fromwhich one recovers ordinary deter-ministic chemical kinetics, and escalating to second-order (k = 2) moment closures that consistently keeptrack only of means and variances (as would be correctfor a Gaussian joint distribution) in chemical reactionnetworks [15, 16], or in population biology reaction-diffusion models [17] explicitly, or by means of theFokker–Planck equation [18] or stochastic differentialequations such as the chemical Langevin equation [19]whichmay sometimesbe further reduced [20].

A slightly higher order scheme is the Kirkwoodsuperposition approximation that retains ⟨ ⟩s si j p t( )

(k = 2) but approximates triple correlations (k = 3) bya function of second-order correlations, is derivablefrom [22, 22] a constrained maximum-entropy senseof approximation ≈, and has been used in multi-cellular stochastic modeling [23]. Fully dynamichigher order cutoffs to the moment hierarchy for

>k 2 include setting higher cumulants to zero [24],dropping higher order terms from the Kramers–Moyal expansion [25] using moment closure func-tions that would be correct for log-normal rather thanGaussian distributions [26], and ‘equation-free’moment closure [27] by sparingly invoking fine-scalesimulations.

Each of these moment closure methods has thecharacter, when compared to a sampling algorithm, offirst exponentially expanding the space in whichdynamics are formulated using the master equation,and then trying to cut the exponentially large spaceback down to size. Typical results start from a small

reaction network with <n 10 molecular species (andthus chemical/biological degrees of freedom if there isa single well-stirred compartment), and produce amore efficient algorithm for determining low-ordermoment trajectories for a possibly reduced model ofbetween about n 2 and n molecular species, yieldingmodel size reductions on the order of a factor of 1 to 2.The initial combinatorial explosion is not fully miti-gated. This is an unpromising route if the goal is tofind a large model reduction beyond what one alreadyhad at the fine scale (though not an impossible one,due to the need to run sampling simulations manytimes). We are proposing a different strategy formoment closure which more naturally results inmodel reduction with fewer chemical/biologicaldegrees of freedom. From the moment closure pointof view, what we are proposing is an arbitrary-ordermethod particularly suited to approximating the CMEand possibly related master equations, by a time-dependent variant of a Boltzmann distribution.

Additional model reductions for stochastic chemi-cal kinetics, other than moment closure methods,include the classic strategy of using separation of timescales to eliminate fast degrees of freedom as carriedout in e.g. the quasi-steady state approximation [28],in adiabatic course-graining [29], and with power lawscaling of time with respect to an overall problem sizeparameter, differentially for different subsets of mole-cular species [30]. Another strategy for molecular spe-cies with small expected population sizes is the FiniteState Projection method which proposes an adaptivetruncation of the state space followed by analytic solu-tion or bounding of the (exponentially big, were it nottruncated) master equation [31]. Other reaction net-workmodel reductionmethods are restricted to deter-ministic models [32, 33] including a reduction from42 to 29 molecular species [34]. The most comparablemethod in terms of problem size may be [35] whichlike GCCD applies to rule-based reaction networks.We will quantitatively compare degrees of modelreduction of thesemethods toGCCD in section 3.4.

As in the case of moment closure methods, all ofthese methods have advantages and interesting ideasbut none that we know of are as yet in the same class ofradical model size reduction as GCCD, in terms offraction of molecular species retained after reduction,in a stochastic biochemical networkmodel.

3. Computational experiments

Molecular complexes in general, and signal transduc-tion complexes in particular often have state spaceexplosions that pose problems for simulation andmodeling. One such macromolecular complex isCa2+/calmodulin-dependent protein kinase II (CaM-KII) in synapses, which is activated when it binds tomultiple calmodulin (CaM) molecules that in turn

6


bind to multiple calcium ions. These processes triggera cascade of reactions in a signal transduction pathwayfollowing Ca2+ influx in response to activation ofligand-gated NMDARs, supporting memory forma-tion and learning.

We applied GCCD to this system as modeled by[1] and as simulated by the Plenum implementation ofdynamical grammars [2, 4] and also in a much largerspatial simulation usingMCell [3].

3.1. CaMKII systemThe synaptic signal transduction pathway that startswith calcium ion (Ca2+) influx through voltage-dependent calcium channels and NMDARs leadsultimately to functional consequences including long-term potentiation (LTP), long-term depression, andspike-timing dependent plasticity underlying learningand memory formation in the hippocampus. Thepathway as studied andmodeled in [1] is structurally abit involved. It begins with an externally imposedinflux of calcium ion Ca2+. Calcium ions bind to thecalmodulin protein (CaM), which has four calcium-binding sites, two at the N-terminal end and two at theC-terminal end. CaM in any state of calcium-loadingcan then bind to unphosphorylated CaMKII mono-mer (which has one phosphorylation site relevant foractivation). However, the binding and unbinding ratesfor calcium ion to CaM depends on the state of the

other binding sites of the CaM protein, and also onwhether or not that CaM is bound to CaMKII.Likewise the binding and unbinding rates for CaM tounphosphorylated CaMKII monomer depend on thestate of CaM. Two CaMKII monomers, each loadedwith CaM in any calcium-binding state, but at mostone of which is phosphorylated, may then dimerize(again with state-dependent rates). Dimers may phos-phorylate one of their constituent subunits andpromptly dissociate; autophosphorylated monomerCaMKII is taken to be the pathway output. Our goal isto produce a reduced stochastic model simplifying thestructure of this fine-scale model as formulated in theMCell and Plenummodels (model files in supplemen-tary information) that aim to implement stochasticversions of the reaction network of [1].

Many subsequent stages of the biological pathwayare thereby omitted, notably the formation of a CaM-KII holoenzyme comprising a ring of six dimers indodecameric complex. This holoenzyme structureimplies an even greater combinatorial explosion ofstates that poses a future modeling challenge, that maybest be met by aggressive model reduction methodssuch as the GCCD method proposed here. Furtherdownstream components of the pathway beyond CaMand the CaMKII holoenzyme are outlined in [8].However we leave such explorations, which couldaim to extract novel biological consequences from

Figure 2.Markov random field or Boltzmann distributionmodel of interaction degree 3 assumed for the CaMKIImodel. The diagramshows the graph-structure of the interaction terms μα αγV (hexagons) of degrees 1, 2, and 3 assumed between all the binary-valuedrandomvariables (colored labelled circles) described in the text. The left and right sides of the graph correspond to twoCaMKIIsubunits thatmay dimerize (bottomvariable), each ofwhichmay be phosphorylated (third rowof variables) and/or bind a calmodulin(second row of variables) which in turnmay bind calciumat its N-terminus orC-terminus and at site 1 or 2 at each end (first row).Omitted from thefigure are extra factors that ensure each binding site is occupied by either zero or onemolecules and notmore. Thisfigure comprises a template for the labelled graph of all randomvariables and their interactions in this application of ‘graph-constrained correlation dynamics’.

7


combinatorially large stochastic reaction networkmodels of the NMDA receptor pathway by applyingthe GCCD model reduction technique, to futureresearch.

3.2. Boltzmannmachine preprocessing stepFigure 2 shows part of the assumed Boltzmanndistribution model interaction graph, or MRF, ofbinary-valued random variables (circles) and mono-mial interaction potentials of degree 1, 2, and 3(hexagons). The first row of variables represent thebinding of calcium to calmodulin (CaM). With sub-scripts these ±1-valued random variables are labelled‘CaM a ic n, , ’. They are indexed by the C-terminusversus the N-terminus of the calmodulin protein (c orn), the numerical index of the binding site on that end(a = 1 or 2), and the numerical index i (i = 0 or 1illustrated; ∈i {0, 1, 2} used in Plenum simulationsbelow) of a calmodulin molecule which may bind to aCaMKII subunit (indexed by j = 0 or 1 illustrated;

∈ …j {0, 8} used in Plenum simulations below). Thesecond row of variables ‘boundi j, ’ records the state ofbinding of CaM to CaMKII subunits. The third row ofvariables ‘phos j’, also written (in the notation of the

MCell model) as ‘Kkp j’, records the binary phosphor-

ylation state of each CaMKII subunit, and the fourthrow of variables ‘ ′dimerjj ’ records whether or not twosuch subunits dimerize. Not shown are additionalcardinality-fixing or winner-take-all interactionsenforcing the constraints that (if it occurs) binding isan exclusive relationship between CaMmolecules andCaMKII subunits, and likewise for dimerizationbetween twoCaMKII subunits.

Weight-sharing is an important strategy inmachine learning, widely used to reduce the numberof trainable parameters and thereby increase general-ization power for a given amount of data. Our weight-sharing scheme shares interaction parameters μαwithin categories of monomial interactions that seemlikely, according to testable human intuition, to havesimilar interaction strengths if trained separately onfar more data. We now introduce some notation forthe weight-sharing. Let = … − {0, #CaM 1} and

= … − {0, #CaMKIIsubunits 1}. We have the fol-lowing categories of ±1-valued randomvariables sI:

∀ ∈∀ ∈ ∀ ∈

∀ ∈ ∀ ∈∀ ∈

∀ < ′ ′ ∈′

( )

a ii j

j

j j j j

CaM ( c n {c, n})

( {1, 2})( ),bound ( )( ),

Kkp ( ),

dimer , .

a i

i j

j

jj

c n, ,

,

We have used the following eight categories of mono-mial interactions sI, s sI J , or s s sI J K (all taking values in±{ 1}):

∀ ∈ ∀ ∈∀ ∈ ∀ ∈∀ ∈∀ ∈∀ ∈ ∀ ∈∀ ∈ ∀ ∈∀ ∈ ∀ ∈

∀ < ′ ′ ∈′ ′

( )

a i

a i

i

i

i j

i j

i j

j j j j

CaM ( {1, 2})( ),

CaM ( {1, 2})( ),

CaM CaM ( ),

CaM CaM ( ),

bound CaM CaM ( )( ),

bound CaM CaM ( )( ),

Kkp bound ( )( ),

Kkp Kkp dimer , .

a i

a i

i i

i i

i j i i

i j i i

j i j

j j jj

c, ,

n, ,

c,1, c,2,

n,1, n,2,

, c,1, c,2,

, n,1, n,2,

,

We number these weight-sharing categoriesα ∈ …{1, 8}. Different categories α have differentnumbers αn of monomial interactions depending onwhich bound indices ′a i j j, , , they run over. We takethe potential function αV for each category to be thecategory-average of the constituent monomials α γV ,

= sI, s sI J , or s s sI J K given above, so that αV= ∑α γ α γ=

αn V(1 ) n1 , .

The resulting Boltzmann distribution can be sam-pled by standardMonte Carlomethods such asMetro-polis–Hastings or Gibbs sampling. We use the‘Dependency Diagrams’ software [13] (availabilitydescribed in supplementary information) to do this. Acrucial wrinkle on our use of such sampling algo-rithms is that we have two different biological condi-tions under which they get used: external calciuminflux can be ‘on’ or ‘off’. We train four different setsof coarse-scale dynamical model parameters θ asdescribed in section 3.3 below, for four phases (earlyand late phases for calcium influx on and for calciuminflux off), and for pulsatile or periodic calcium influxwe cycle through the four trained (or partly trained)models, while preserving the interaction parameters μthrough each instantaneous phase-switching event.Once trained, these four models predict dynamicalresponses to any other temporal pattern of calciuminflux switching including the use of periodic switch-ingwith frequencies not in the training set.

A logical alternative to this procedure would be touse just two phases for calcium on and off, and to addthe binary calcium-influx variable into the GCCDgraph of figure 2 with suitable connections to allow theswitch variable to join inmodulating some or all of thepotentials αV . However, we were not able to get thissomewhat cleaner approach to work numerically. Justswitching between two phases ( +Ca2 influx on versusoff) rather than four phases without modifying theGCCDgraph also produced less robust behavior.

For theMCell simulations below the waveform forcalcium influx was a pulse train or rectangular wave,with the ‘on’ duration being 16 ms. For the Plenumsimulations we used the ‘alpha model’ theoretical cal-cium influx profile [37].

In figures 3 and 4, the inferred trajectories of μtime-varying interaction parameters are shown for thenon-spatial Plenum [2] model (figure 3) and the spa-tial MCell [3] model (figure 4). The correspondingmoments ⟨ ⟩ ∈ −αV [ 1, 1]are shown in figures 6 and 7

8


of the supplementary information, where their rela-tion to concentrations is discussed. From figures 3 and4 it is evident that the inferred μα interaction para-meter trajectories are remarkably continuous in time,empirically justifying the assumption of ODE dynam-ics in equation (12).

3.3. GCCDThe resulting time series (such as figures 3 and 4) areconvolved with the analytic temporal derivative of aGaussian filter in order to smoothly estimate the ratesof change μα td d of the interaction parameters μα t( ).

Such temporal smoothing is useful since takingderivatives of a noisy time series tends to enhance thenoise, and in the absence of smoothing that occurs forour time series [13]. The resulting smoothed timederivatives are fit to equation (12) using either (1)onlineminimization of the K–L divergence as outlinedin appendix for which O = S, or (2) lasso-regularizedlinear regression [36] for which O = R, and whichperformsmodel selection on the bases of equation (13)as discussed there. Here we report numerical resultsfor the second method, which also defines a meaningfor the ≈ symbol in section 2.1 as the lasso-regularized

Figure 3.Time series of eight BMLA-inferred μα t( ) interaction parameter statistics, from the simulation data generated by the Plenummodel in section 7.4 of the supplementary information text. Seven successive periods of an 8Hz pulse train (alphamodel waveform[37]) of calcium influx are plotted. In each pulse period the μ dynamics switches through four phases: fast calcium-on (1.6 ms), slowcalcium-on (14.4 ms), fast calcium-off (8 ms), slow calcium-off (the rest of the period), with different sets of trainedweight θ for eachphase, whilemaintaining continuity of μ values through the switching events. The legend shows the color of the plot line associated toeach of eight potentials αV indicated by the given triples of state variables. Time course traces are labelled by random-variablemonomials following the indexing offigure 2, although the actual graph is larger due to increased range of indices ′a i j j, , , as definedin the text.Weight-sharing allowsmany different potential-functionmonomials α γV , differing only by the value of their indices γ(mapped to various combinations of ′a i j j, , , as defined in the text), to share weights μα . The correspondingmoments ⟨ ⟩αV t( ) areplotted infigure 6, supplementary information text.

Figure 4.Time series of eight BMLA-inferred μα t( ) interaction parameter statistics, from the simulation data generated by theMCellmodel in section 7.4 of the supplementary information text, with 8Hz rectangular wave pulse train of +Ca2 influx.Other plottingdetails are as in figure 3. Results differ significantly from those offigure 3 because theMCellmodel is extended over space, includesdiffusion of reactants, andmodels a larger volumewithmoremolecules of each species. The correspondingmoments ⟨ ⟩αV t( ) areplotted infigure 7, supplementary information text.

9


(L1-regularized) sum of squared differences for thetime derivatives of the μα t( ) statistical interactionparameters (summed over α, discretized t, and overany set of initial conditions and/or other inputconditions c such as calcium influx frequency). Thus,we optimize the model parameters θαA by minimizingthe score

∑

∑

θ θ

λ θ

=

− +

αα

μα

μ

αα

α

α

( )S

. (14)

A

t c

t

t A

t

tA

A

, ,

d ( )

dfit

d ( )

d BMLA

2

discr

⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦

(Defining a sense of approximation ≈ as needed insection 2.1 )which by equation (12) is equivalent to

∑

∑ ∑

θ

θ μ λ θ

=

− +

αα

μ

αα

α

α( )S

f ( ) , (15)

A

t c

t

t

A

A AA

A

, ,

d ( )

dBMLA

2

discr

⎡⎣ ⎤⎦

a form that is explicitly lasso-regularized least squaresoptimization of the trainable interaction parameters θ.The single scalar hyperparameter λwas set using leave-one-out cross validation, with each calcium influxspike held out in turn. 150 out of 253 model basesparameters were always rejected by the lasso

algorithm, and the remaining 103 bases were usedsparsely depending on the μα derivative being fit,which of the four phases was being fit, and whichBMLA experiment data set was being used.

The resulting constrained time-evolution of ODE-constrained interaction parameters μα t( ) evaluatedout-of-sample (i.e. using different data than was usedfor training the model parameters) was almost indis-tinguishable from the unconstrained values of optimalinteraction parameters obtained by BMLA, as a func-tion of time over several calcium influx cycles, asshown infigure 5.

Figure 5.Dynamics of μ for coarse-scale GCCDmodel tested out-of-sample (red lines), plotted alongwith BMLA inferred values of μ(other colors) fromPlenum stochastic simulations of Ca2+/CaM/CaMKII network, using the alphamodel [37] for calcium influx.Panels A, B andC visually separate pairs of (model, data) curves that would otherwise overlap, for ease of visual comparison. A highdegree of overlap between (model, data) curve pairs is evident. Quantitative comparison of each (model, data) pair is given in table 1.Simulation is out-of-sample in that the calcium influx bursts occurwith frequency 8Hz, thoughGCCDwas trained at frequencies 2, 4,and 10Hz.

Table 1.Normalized rms errors infigure 5.

Interaction parameter rms errorNormalized rms

error

CaM(c,1,0) rms1 = 0.062 cvrms1= 0.132CaM(n,1,0) rms2 = 0.070 cvrms2= 0.174CaM(c,1,0) CaM(c,2,0) rms3 = 0.096 cvrms3= 0.170CaM(n,1,0) CaM(n,2,0) rms4 = 0.075 cvrms4= 0.141bound(0,0) CaM(c,1,0)CaM(c,2,0)

rms5 = 0.068 cvrms5= 0.044

bound(0,0) CaM(n,1,0)CaM(n,2,0)

rms6 = 0.061 cvrms6= 0.118

Kkp0 bound(0,0) rms7 = 0.090 cvrms7= 0.133Kkp0Kkp1 dimer(0,1) rms8 = 0.044 cvrms8= 0.023

10


Numerical errors in figure 5 are shown in table 1,computed as root mean squared (rms) error and alsoas normalized rms error, in which the normalization isdone by finding the average of the absolute value ofeach target BMLA time course, and dividing both thetarget and corresponding prediction time course bythis average absolute value before computing the rmserror as usual.

The results show good out-of-sample quantitativeagreement for all eight time series.

3.4.Discussion of resultsIn quantitative terms, the degree of model reductionobtained by GCCD in the CaMKII example is large.Eight dynamical variables (the interaction parametersμ) suffice to predict key outputs such as the globaldegree of CaMKII phosphorylation, each of which canbe computed from expectations of random variables sin the Boltzmann distribution of equation (11). Butthe fine-scale set of dynamical variables ismuch larger.In theMCell simulations wemay conservatively countit as the integer-valued populations of the followingclasses of molecular species: free Ca2+, free CaM( × + =3 3 1 10 species), monomeric CaMKII sub-unit which can bind CaM in any of its states and canalso be phosphorylated ( × × =3 3 2 18 species), anddimerize if at most one subunit is phosphorylated( × + × =9 9 9 10 2 126 species; phosphorylateddimers dissociate before they can doubly phosphor-ylate) for a total of 155 species each of which has adynamical random variable, and therefore a totalreduction from 155 to just eight dynamical variables,which is very large.

In one sense this reduction in the number of dyna-mical variables strongly understates the situation,because in the actual MCell simulation (though not inthe Plenum simulations), every individual chemicalspecies has its own independent three-dimensionalposition which is also a dynamical random variable.Given that a typical molecular population excludingCa2+ is 145 CaM + 385 CaMK = 530, and includingeach 24-unit Ca2+ pulse is still greater, and that thenumber of position degrees of freedom is three timesgreater (1590 or more), the reduction to just eightdynamical variables μ1 through μ8 as listed infigures 3–5 is evenmore remarkable.

Comparable figures for the (deliberately non-spa-tial andwell-mixed) Plenum simulations are again 155molecular species reduced down to eight. An inter-mediate step is the formulation of the graph in figure 2which has just eight interaction parameters (asso-ciated with the hexagonal interaction nodes) due toweight sharing, but it has many more molecular bind-ing variables (circular nodes in figure 2). For the Ple-num model we can count these variables as follows:due to the smaller simulated volume than for theMCell model, the index i for CaM and the index j forCaMKII run from 0 to 2 and 0 to 8 respectively. If we

replicate the circular nodes in the graph of figure 2accordingly, there are ×4 3, ×3 9, 9, and ×9 10 2variables respectively in rows 1–4 of the full version ofthe graph illustrated in figure 5, for a total of 93 fine-scale binary-valued variables in a highly structuredpattern.

As stated in section 2.4, most current results forstochastic model reduction start from a small reactionnetwork with handful of chemical species, and pro-duce a more efficient (sometimes much more effi-cient) modeling algorithm for a possibly reducedmodel with model size reductions on the order of afactor of 1 to 2. The authors of [35], a stochastic andrule-based model reduction method as is GCCD,achieve a larger model reduction in an EGF signaltransduction model from 2768 to 609 molecular spe-cies, for a 4.5 × reduction factor or a 0.81-power rela-tion of reduced to full model ( ≃609 27680.809). Wehave demonstrated for GCCD at least 155 to 8 for a19.5 × reduction factor, or a 0.41-power relation( ≃8 1550.412). This factor of almost 20 breaks decisi-vely out of the pattern of model size reductions on theorder of a factor of just 1 to 2 in number of chemical orbiological degrees of freedom. Exploring tradeoffsbetween accuracy of approximation and amount ofmodel size reduction (whether measured in directratios, or powers, of number of degrees of freedombefore and after reduction) for increasingly largemod-els would therefore seem to be an attractive topic forfuturework.

To what features of GCCD or the present pro-blem do we owe these very substantial reductions inmodel size? Future work could disentangle the fol-lowing seemingly relevant factors: (1) GCCD can beapplied to rule-based models, which are highly struc-tured in that (as shown in the Plenum model code inthe supplementary information text) a small numberof rules can combinatorially code for a much largermolecular species-level reaction network. Thus anunderlying simplicity is available. (2) The use ofweight-sharing may make it possible to exploit suchunderlying simplicity in the problem domain. (3)The machine learning components of GCCD (BMLAand the new procedures for determining GCCDmodel parameters θ) generalize directly from specificsimulation data rather than algebraically and in fullgenerality from the mathematical form of the fine-scale model. So currently available computer poweris used effectively. (4) The Boltzmann distribution isa natural form for prolongation from coarse to finemodels since by constrained entropy maximization itdoesn’t add any more information than is given byconstraints on whatever moments were chosen foruse in the GCCD graph structure. (5) Thegraph structure of GCCD is a good mechanism forimporting biological knowledge and expertise intothemodel reduction process.

11


4. Conclusion

We propose a nonlinear model reduction methodparticularly suited to approximating the CME forstochastic chemical reaction networks (includinghighly structured ones resulting from ‘parameterized’or ‘rule-based’ reactions), by a time-dependent variantof a Boltzmann distribution. The resulting GCCDmethod can be an accurate nonlinear model reductionmethod for stochastic molecular reaction networksinvolving a combinatorial explosion of states, such asthe CaMKII signal transduction complex that isessential to synaptic function, particularly in neuronallearning processes such as LTP. The GCCD methodcould be further developed inmany directions, includ-ing application to model-reduction for the masterequation semantics of more challenging molecularcomplexes such as the CaMKII dodecamer, use of theresulting reduced models to extract novel biologicallyrelevant predictions, and generalizing the method toyet more general reaction-like biological modelingformalisms capable of expressing multiscale models indevelopmental biology [4].

Acknowledgments

Wewish to acknowledgemany useful discussions withMKennedy, S Pepke, Tamara Kinzer-Ursem, and DHSharp. This work was supported by NIH grant RO1GM086883 and also supported in part by the UnitedStates Air Force under Contract No. FA8750-14-C-0011 under the DARPA PPAML program, by NIHgrant R01 HD073179 to Ken Cho and EM, by theLeverhulme Trust, and by the hospitality of the Sains-bury Laboratory Cambridge University. TB and TSwere supported by NIH P41-GM103712 NationalCenter for Multiscale Modeling of Biological Systems;NIH MHO79076, and the Howard Hughes MedicalInstitute.

AppendixA.Online learning derivations

Webeginwith the definition of KL divergence

∫∥ = − ( ) ( )p p p p p x˜ ˜ log ˜ d (16)KL

(or equivalently, ∫ p p p˜ log(˜ )). We could equally wellbegin with the divergence in the other direction, as∫− p p plog(˜ ); the analogous derivation in that case issimilar to what follows (and is performed in appendixA of [13]) but in our experience the resulting trainingalgorithmproduced slightly less reliable results.

Our approach will be to compute the derivativeμ∂ ∂ αKL , then take the time-derivative of this term,

and minimize that. Minimization of μ∂ ∂ αKL corre-sponds tomatching the distribution p̃ to p at an initialtime point, and we will need to take for granted theability to do this well once, as an initialization. Then, if

we have done a good job of that, keeping the change inthis term 0—setting the derivative of this term equal tozero and solving for the parameters of a GCCDmodel—will track the optimal solution as the two distribu-tions change in time.

Though we have so far defined variable x to takevalues in a discrete space, we use the integral notationthroughout this derivation, as integration specializesto summation on a discrete domain, but the converseis not true.

The first few steps here are just pulling apart terms,starting from the definitions of theMRF, as follows:

∫μ μ μ= − t p x tp x t

p x tx( ( )) ˜( ; ( ))log

( ; )˜( ; ( ))

dKL

⎛⎝⎜

⎞⎠⎟

so

∫∑

μ μ

μ μ

= −

× +

+

ββ β

=

t p x t

t V x Z t

p x t x

( ( )) ˜( ; ( ))

( ) ( ) log ( ( ( )))

log ( ( ; )) d .

KL

1

cliques⎛⎝⎜⎜

⎞⎠⎟⎟

Now evaluate the derivative μ μ∂ ∂ α t( ( ))KL⎡⎣ ⎤⎦ .

Beginningwith the product rule, we have

∫

∑

μμ

μ μ

μ μ

∂∂

= − ∂∂

× +

+

α

α

ββ β

=

t

p x t

t V x Z t

p x t x

( ( ))

˜( ; ( ))

( ) ( ) log ( ( ( )))

log ( ( ; )) d (17)

KL

1

cliques

⎛⎝⎜⎜

⎞⎠⎟⎟

⎛⎝⎜⎜

⎞⎠⎟⎟

∫ μ− p x t˜( ; ( )) (18)

∑μ μ μ× ∂∂ +

+

α ββ β

=t V x Z t

p x t x

( ) ( ) log ( ( ( )))

log ( ( ; )) d . (19)

1

cliques⎛⎝⎜⎜

⎞⎠⎟⎟

Breaking this expression down, we begin by evalu-ating μμ

∂∂ α

p x t˜( ; ( )).

μ μ μ μ∂

∂ = ∂∂

∏

= −α α

βμ

α α

− β β

( )

p x tZ t

p x V V x

˜( ; ( ))e

( ( ))

˜( ) ( ) . (20)

V x( )

Unless otherwise noted, all expectations ⟨…⟩ arewith respect to the distribution p̃.

Plugging this result back into equation (19), andusing μ∂ ∂ =αp x tlog ( ( ; )) 0, along with the evalua-tion of μ μ∂ ∂ αZ tlog ( ( ( ))) as −⟨ ⟩αV , we have

12


∫

∫

∑

∑

μμ

μ

μ μ

μ

μ μ

∂∂

= − −

× +

+

−

× ∂∂ −

α

α α

ββ β

α ββ β α

=

=

( )

( )

t

p x t V V x

t V x Z t

p x t x

p x t

t V x V x

( ( ))

˜( ; ( )) ( )

( ) ( ) log ( ( ( )))

log ( ( ; )) d

˜( ; ( ))

( ) ( ) d . (21)

KL

1

cliques

1

cliques

⎛⎝⎜⎜

⎞⎠⎟⎟

⎛⎝⎜⎜

⎞⎠⎟⎟

If we separate the second integral, we see that thetwo halves cancel:

∫∫ ∫∫ ∫

μ

μ μ

μ μ

−

= −

= −= − =

α α

α α

α α

α α

( )p x t V x V x

p x t V x x p x t V x

p x t V x x V p x t x

V V

˜( ; ( )) ( ) d

˜( ; ( )) ( )d ˜( ; ( )) d

˜( ; ( )) ( )d ˜( ; ( ))d

0.

Wecan now absorb the leadingminus sign, leaving

∫∑

μμ

μ

μ

μ

∂∂

= −

×

+ +

α

α α

ββ β

=

( )

t

p x t V x V

t V x

Z t p x t x

( ( ))

˜( ; ( )) ( )

( ) ( )

log ( ( ( ))) log ( ( ; )) d . (22)

KL

1

cliques⎛⎝⎜⎜

⎞⎠⎟⎟

Now we wish to find the derivative ofequation (22) with respect to time. In order to accom-plish this, we first name pieces of equation (22). Let

μ= − ⟨ ⟩α α( )A p x t V x V˜( ; ( )) ( ) , μ= ∑β β=B t( )1cliques

βV x( ), μ=C Z tlog ( ( )), =D p x tlog ( ; ). Then wecanwrite the desired derivative as

∫μμ

∂∂ = ∂

∂ + ∂∂ + ∂

∂

+ + + ∂∂

α

t

tA

tB A

tC A

tD

B C Dt

A x

d

d

( ( ))

( ) d . (23)

KL ⎡⎣⎢

⎤⎦⎥

Concentrating on the first two of these terms,∫ ∂ ∂ + ∂ ∂A tB A tC x[ ]d , we see that ∂ ∂B t =

μ∑ ∂ ∂β β βt V( ) and, using the chain rule, ∂ ∂C t =

μ μ∑ ∂ ∂ ∂ ∂β β βt Z( )( log ). Thus,

∫

∫ ∑ μ

∂∂ + ∂

∂

=∂∂ −

β

ββ β( )

At

B At

C x

At

V V x

d

d

⎡⎣⎢

⎤⎦⎥

⎡⎣⎢⎢

⎤⎦⎥⎥

∫ ∑μμ

=∂∂

× − −β

β

β β α α( )( )

p x tt

V V V V x

˜( ; ( ))

d .

Shifting the integral inside the sum, we recognize theexpression for the covariance between functions αVand βV , where ⟨ − ⟨ ⟩ − ⟨ ⟩ ⟩x x y y( )( )= = ⟨ ⟩ − ⟨ ⟩⟨ ⟩x y xy x yCov( , ) , so

∫∑

∑

μ

μ

∂∂ + ∂

∂

=∂

∂

=∂

∂ −

β

βα β

β

ββ α α β

=

=( )

( )

At

B At

C x

t

tV V

t

tV V V V

d

( )Cov ,

( ). (24)

1

cliques

1

cliques

⎡⎣⎢

⎤⎦⎥

Turning our attention now to the fourth term ofequation (23):

∫

∑

μ

μ

μ

∂∂ −

×

+ +

α α

ββ β

=

( )( )tp x t V x V

t V x

Z t p x t x

˜( ; ( )) ( )

( ) ( )

log ( ( ( ))) log ( ( ; )) d ,

1

cliques

⎜ ⎟⎛⎝

⎞⎠

⎛⎝⎜⎜

⎛⎝⎜⎜

⎞⎠⎟⎟

⎞⎠⎟⎟

applying product and chain rules to equation (20), thetime derivative ofA, where

μ∂ ∂ = ∂ − ∂α α( )tA p x t V x V t˜( ; ( )) ( ) ,⎡⎣ ⎤⎦is

∑

∑

∑

μ

μ

μμ

μμ

∂∂ =

∂∂ −

× −

− ∂∂

= −∂∂

− −

+∂∂

−

β

ββ β

α α

α

β

ββ

β α α

β

βα β

α β

( )

)( )(

)

( )

(

tA

tp x V V x

V x V

p x tt

V

p x tt

V x

V V x V

p x tt

V V

V V

˜( ) ( )

( )

˜( ; ( ))

˜( ; ( )) ( )

( )

˜( ; ( ))

. (25)

⎡⎣⎢

⎤⎦⎥⎥

Note that the covariance between αV and βV hasappeared again, and that it is being subtracted fromterms which have the same form as the terms inside acovariance. That is, integrating over the first part ofequation (25) would again produce the covariancebetween αV and βV , times the derivative of μ.

Terms such as − ⟨ ⟩β βV x V( ) recur regularlythroughout this derivation. Therefore, we define a new

13


notation. Let Δ ≡ − ⟨ ⟩X X X . Then we can rewrite thecovariance in equation (24) as

∑

∑

μ

μΔ Δ

∂∂ −

=∂

∂

β

ββ α α β

β

βα β

=

=

( )t

tV V V V

t

tV V

( )

( ).

1

cliques

1

cliques

Additionally, equation (25) fits the same pattern,with

Δ Δ

= − −

=β β α α

α β

( )( )( )

X V x V V x V

V V

( ) ( )

.

So, wemay rewrite this as

∑μμ

Δ Δ Δ∂∂ = −

∂∂β

βα β( )

tA p x t

tV V˜( ; ( )) .

Plugging this in and copying the definitions for B,C, andD, we have

∫∫

∫

∫

∑

∑

∑

∑

μμ

Δ Δ Δ μ

μμ

Δ Δ Δ μ

μμ

Δ Δ Δ

+ + ∂∂

= −∂∂

×

−∂∂

×

−∂∂

×

β

β

α βγ

γ γ

β

β

α β

β

β

α β

=( )

( )

( )

B C Dt

A x

p x tt

V V t V x x

p x tt

V V Z t x

p x tt

V V p x t x

( ) d

˜( ; ( ))

( ) ( ) d

˜( ; ( ))

log ( ( ))d

˜( ; ( ))

log ( ; )d .

1

cliques⎛⎝⎜⎜

⎞⎠⎟⎟

In the second of these terms, the factorμZ tlog ( ( )) is not a function of x. As a result, when the

integral is evaluated, the resulting expected values allcancel. So this term is zero, leaving

∫∫

∫

∑

∑

∑

μμ

Δ Δ Δ μ

μμ

Δ Δ Δ

+ + ∂∂

= −∂∂

×

−∂∂

×

β

β

α βγ

γ γ

β

β

α β

=( )

( )

B C Dt

A x

p x tt

V V t V x x

p x tt

V V p x t x

( ) d

˜( ; ( ))

( ) ( ) d

˜( ; ( ))

log ( ; )d . (26)

1

cliques⎛⎝⎜⎜

⎞⎠⎟⎟

Once again, we break apart equation (26) and con-sider the parts individually. Beginning with the firstintegral, we regroup the terms of the multiplication so

that there is just one double-sum over cliques, thenmove terms which do not depend on x outside of theintegral, and integrate. This leaves another expectedvalue:

∫ ∑

∑

∑ ∑

μμ

Δ Δ Δ

μ

μ

μΔ Δ Δ

−∂∂

×

= −

×∂

∂

β

βα β

γγ γ

β γγ

βγ α β

=

= =

( )

( )

p x tt

V V

t V x x

t

t

tV V V

˜( ; ( ))

( ) ( ) d

( )

( ). (27)

1

cliques

1

cliques

1

cliques

⎛⎝⎜⎜

⎞⎠⎟⎟

This expression belies some of the symmetry ofthis term. Note that, if = γX V and Δ Δ= α β( )Y V V ,

the inner most term is Δ⟨ ⟩X Y . From the definitionsof Δ and expectation, this is equivalent to⟨ ⟩ − ⟨ ⟩⟨ ⟩XY X Y , which once again is X YCov( , ). Ofcourse, covariance relationships are symmetric, so thisis also Y XCov( , ), which from the preceding argu-ment is Δ⟨ ⟩Y X . Thus

Δ Δ= =X Y X Y Y XCov( , ) . (28)

Applying this transformation to equation (27), wehavefinally

∑ ∑

∑ ∑

μμ

Δ Δ Δ

μμ

Δ Δ Δ

−∂

∂

= −∂

∂

β γγ

βγ α β

β γγ

βα β γ

= =

= =

( )tt

tV V V

tt

tV V V

( )( )

( )( )

. (29)

1

cliques

1

cliques

1

cliques

1

cliques

Turning to the second half of equation (26), thesteps are similar, but the γV terms are replaced with

p xlog ( ) terms:

∫ ∑

∑

μμ

Δ Δ Δμ

Δ Δ Δ

−∂∂

×

= −∂

∂

β

β

α β

β

βα β

=

( )

p x tt

V V p x t x

t

tV V p x t

˜( ; ( ))

log ( ; )d

( )log ( ; ) . (30)

1

cliques

Thefinal piece of equation (23) is

∫ ∫ μ

Δ

∂∂ =

− ∂∂

= ∂∂

α

α

α

)(A

tD x p x t V x

Vt

p x t x

Vt

p x t

d ˜( ; ( )) ( )

log ( ; )d

log ( ; ) . (31)

Now that we have analyzed each of the parts ofequation (23), we can set it equal to zero andmove theterms with a sum over cliques across the equals sign.Thenwe have as a solution

14


∑

∑

Δ

μ

Δ Δ Δ

Δ Δ

μ Δ Δ Δ

∂∂

=∂

∂

×

−

+

α

β

β

α β

α β

γγ α β γ

=

=

Vt

p x t

t

t

V V p x t

V V

t V V V

log ( ( ; ))

( )

log ( ( ; ))

( ) . (32)

1

cliques

1

cliques

⎛⎝⎜⎜

⎞⎠⎟⎟

Following themaster equation for the derivative ofp, and setting μ μ∂ ∂ ≡β βt t f t( ) ( ( )) this expression

becomes

∑

∑

Δ μ

Δ Δ Δ Δ Δ

μ Δ Δ Δ

=

× −

+

αβ

β

α β α β

γγ α β γ

=

=

VW p t x

p x tf t

V V p x t V V

t V V V

( · (; ))( )

( ; )( ( ))

log ( ; )

( ) .

1

cliques

1

cliques

⎛⎝⎜⎜

⎞⎠⎟⎟

Then, using equation (12) to define a linear formfor βf , we have

∑ ∑

∑

Δ θ μ

Δ Δ Δ Δ Δ

μ Δ Δ Δ

=

× −

+

αβ

β

α β α β

γγ α β γ

= =

=

VW p t x

p x tf t

V V p x t V V

t V V V

( · (; ))( )

( ; )( ( ))

log ( ; )

( ) .

A

A A1

cliques

1

bases

1

cliques

⎛⎝⎜⎜

⎞⎠⎟⎟

The p expressions in numerator and denominatormaybe evaluated using the BMLA-trained approximationof p x t( , ) at time t. Then, these expressions are finallyin a form which can be evaluated during Monte Carlosimulations of p̃, resulting in an online learningalgorithm in the spirit of BMLA itself, though morecomplicated. To this end we now define a vector Bwithα entries

Δ=α αVW p t x

p x tB

( · (; ))( )

( ; ), (33)

and a structured α-by- β A( , )matrix A with entries

μ

Δ Δ Δ

=

×

α β

α β

f t

V V p x t

A ( ( ))

log ( ; )

A A,( , )

⎛⎝⎜⎜

∑

Δ Δ

μ Δ Δ Δ

−

+

α β

γγ α β γ

=

V V

t V V V( ) . (34)1

cliques ⎞⎠⎟⎟

Then B = θA · , where the dot product is taken overthe compound index β A( , ). Finally, by calculatingvalues for A and B we can solve for optimal θ.

It will generally be true that this system ofequations is under-determined, as the dimensions ofthe matrix A are × ×n n m( ), where n is the numberof potentials in the MRF and m is the total number ofbases, and vector B has length n. Therefore, it is usefulto build larger A and B matrices by stacking togetherin a block fashion several copies of equations (33)and (34) together which have been computed usingdifferent distributions p and p̃, coming from differentinitial conditions or other input conditions as sug-gested in the operator approximations ofequations (2) and (4), and characterized by differentvalues of μ. As long as these copies are linearly inde-pendent this procedure can produce a fully con-strained systemof equations.

References

[1] Pepke S, Kinzer-UrsemT,Mihalas S andKennedyMB2010Dynamicmodel of interactions of Ca2+, calmodulin, andcatalytic subunits of Ca2+/calmodulin-dependent proteinkinase: II.PLoS Comput. Biol. 6 e1000675

[2] Mjolsness E andYosiphonG2006Stochastic process semanticsfor dynamical grammarsAnn.Math.Artif. Intell.47 329–95

[3] Kerr RA, Bartol TM,Kaminsky B,DittrichM,Chang J C J,Baden SB, Sejnowski T J and Stiles J R 2008 FastMonte Carlosimulationmethods for biological reaction-diffusion systemsin solution and on surfaces SIAM J. Sci. Comput. 30 3126

[4] Mjolsness E2013Time-orderedproduct expansions forcomputational stochastic systemsbiologyPhys. Biol.10035009

[5] HlavacekWS, Faeder J R, BlinovML, Posner RG,HuckaMand FontanaW2006Rules formodeling signal-transduction systems Scienceʼs STKE 2006 re6

[6] Danos V and LaneveC 2004 Formalmolecular biologyTheor.Comput. Sci. 325 69–110

[7] Gillespie DT 1977 Exact stochastic simulation of coupledchemical reactions J. Phys. Chem. 81 2340–61

[8] KennedyMB, BealeHC,CarlisleH J andWashburn LR 2005Integration of biochemical signalling in spineNat. Rev.Neurosci. 6 423–34

[9] Yavneh I 2006Whymultigridmethods are so efficientComput. Sci. Eng. 8 12–22

[10] Brandt A 1977Multi-level adaptive solutions to boundary-value problemsMath. Comput. 31 333–90

[11] Smyth P 1997 Belief networks, hiddenMarkovmodels, andMarkov random fields: a unifying viewPattern Recognit. Lett.18 1261–8

[12] AckleyDH,HintonGE and Sejnowski T J 1985A learningalgorithm for BoltzmannmachinesCogn. Sci. 9 147–69

[13] JohnsonT2012Dependencydiagramsandgraph-constrainedcorrelationdynamics: new systems for probabilistic graphicalmodelingPhDThesisUCIrvineComputer ScienceDepartment.Available atURLhttp://ics.uci.edu/∼johnsong/thesis/

[14] Gillespie DT 1996Themultivariate Langevin and Fokker–Planck equationsAm. J. Phys. 64 1246–57

[15] Lee CH,KimK-H andKimP2009Amoment closuremethodfor stochastic reaction networks J. Chem. Phys. 130 1341–7

15


http://dx.doi.org/10.1371/journal.pcbi.1000675

http://dx.doi.org/10.1007/s10472-006-9034-1

http://dx.doi.org/10.1007/s10472-006-9034-1

http://dx.doi.org/10.1007/s10472-006-9034-1

http://dx.doi.org/10.1137/070692017

http://dx.doi.org/10.1088/1478-3975/10/3/035009

http://dx.doi.org/10.1126/stke.3442006re6

http://dx.doi.org/10.1016/j.tcs.2004.03.065



http://dx.doi.org/10.1021/j100540a008

http://dx.doi.org/10.1021/j100540a008

http://dx.doi.org/10.1021/j100540a008

http://dx.doi.org/10.1038/nrn1685



http://dx.doi.org/10.1109/MCSE.2006.125



http://dx.doi.org/10.1090/S0025-5718-1977-0431719-X

http://dx.doi.org/10.1090/S0025-5718-1977-0431719-X

http://dx.doi.org/10.1090/S0025-5718-1977-0431719-X

http://dx.doi.org/10.1016/S0167-8655(97)01050-7

http://dx.doi.org/10.1016/S0167-8655(97)01050-7

http://dx.doi.org/10.1016/S0167-8655(97)01050-7

http://dx.doi.org/10.1207/s15516709cog0901_7



http://ics.uci.edu/~johnsong/thesis/

http://dx.doi.org/10.1119/1.18387

http://dx.doi.org/10.1119/1.18387

http://dx.doi.org/10.1119/1.18387

http://dx.doi.org/10.1063/1.3103264

http://dx.doi.org/10.1063/1.3103264

http://dx.doi.org/10.1063/1.3103264

[16] Schnoerr D, Sanguinetti G andGrimaR 2014Validityconditions formoment closure approximations in stochasticchemical kinetics J. Chem. Phys. 141 084103

[17] Gandhi A, Levin S andOrszag S 2000Moment expansions inspatial ecologicalmodels andmoment closure throughGaussian approximationsBull.Math. Biol. 62 595–632

[18] RiskenH 1989The Fokker–Planck Equation 2nd edn (Berlin:Springer)

[19] Gillespie DT 2000The chemical Langevin equation J. Chem.Phys. 113 297

[20] Sotiropoulos V, Contou-CarrereM-N,Daoutidis P andKaznessis YN 2009Model reduction ofmultiscale chemicalLangevin equations: a numerical case study IEEE/ACMTrans.Comput. Biol. Bioinformatics 6 470–82

[21] Singer A 2004Maximumentropy formulation of theKirkwood superposition approximation J. Chem. Phys. 1213657–66

[22] RaghibM,Hill NA andDieckmannU2011Amultiscalemaximumentropymoment closure for locally regulatedspace–time point processmodels of population dynamicsJ.Math. Biol. 62 605–53

[23] MarkhamDC,BakerREandMaini PK2014Modellingcollective cell behaviorDiscreteContinuousDyn. Syst.345123–33

[24] Hespanha J 2008Moment closure for biochemical networks3rd Int. Symp. onCommunications, Control and SignalProcessing (ISCCSP)

[25] vanKampenN1992 Stochastic Processes in Physics andChemistry (Amsterdam:NorthHolland)

[26] SinghA andHespanha J P 2006 Lognormalmoment closuresfor biochemical reactions Proc. 45th IEEEConf. onDecisionandControl (SanDiego, December 13–15) pp 2063–8

[27] Alexander F J, JohnsonG, EyinkGL andKevrekidis I G 2008Equation-free implementation of statisticalmoment closuresPhys. Rev.E 77 26701

[28] RaoC andArkinAP 2003 Stochastic chemical kinetics and thequasi-steady-state assumption: application to theGillespiealgorithm J. Chem. Phys. 118 4999–5010

[29] SinitsynNA,HengartnerN andNemenman I 2009Adiabaticcoarse-graining and simulations of stochastic biochemicalnetworks Proc. Natl Acad. Sci. USA 106 10546–51

[30] KangHWandKurtz TG2013 Separation of time-scales andmodel reduction for stochastic reaction networksAnn. Appl.Probab. 23 529–83

[31] Munsky B andKhammashM2006The finite state projectionalgorithm for the solution of the chemicalmaster equationJ. Chem. Phys. 124 044104

[32] LebiedzD, SkandaD and FeinM2008Automatic complexityanalysis andmodel reduction of nonlinear biochemicalsystemsComputationalMethods in Systems Biology (LectureNotes in Computer Science vol 5307) (Berlin: Springer)pp 123–40

[33] Hangos KM,GaborA and Szederkenyi G 2013Modelreduction in bio-chemical reaction networks withMichaelis–Menten kinetics Proc. 2013 EuropeanControl Conf. (ECC)(Zurich, Switzerland)

[34] Rao S, van der Schaft A, van EunenK, Bakker BMandJayawardhana B 2014Amodel reductionmethod forbiochemical reaction networksBMCSyst. Biol. 8 52

[35] Feret J,Henzinger T, KoepplH and Petrov T 2012 Lumpabilityabstractions of rule-based systemsTheor. Comput. Sci. 431137–64

[36] Friedman J,Hastie T andTibshirani R 2010RegularizationPaths for generalized linearmodels via coordinate descentJ. Stat. Softw. 33 1–22

[37] Destexhe A,Mainen Z F and Sejnowski T J 1994 Synthesis ofmodels for excitablemembranes, synaptic transmission andneuromodulation using a common kinetic formalismJ. Comput. Neurosci. 1 195–230

16


http://dx.doi.org/10.1063/1.4892838

http://dx.doi.org/10.1006/bulm.1999.0119



http://dx.doi.org/10.1063/1.481811

http://dx.doi.org/10.1109/TCBB.2009.23



http://dx.doi.org/10.1063/1.1776552

http://dx.doi.org/10.1063/1.1776552

http://dx.doi.org/10.1063/1.1776552

http://dx.doi.org/10.1063/1.1776552

http://dx.doi.org/10.1007/s00285-010-0345-9

http://dx.doi.org/10.1007/s00285-010-0345-9

http://dx.doi.org/10.1007/s00285-010-0345-9

http://dx.doi.org/10.3934/dcds.2014.34.5123



http://dx.doi.org/10.1103/PhysRevE.77.026701

http://dx.doi.org/10.1063/1.1545446

http://dx.doi.org/10.1063/1.1545446

http://dx.doi.org/10.1063/1.1545446

http://dx.doi.org/10.1073/pnas.0809340106



http://dx.doi.org/10.1214/12-AAP841



http://dx.doi.org/10.1063/1.2145882

http://dx.doi.org/10.1186/1752-0509-8-52





http://dx.doi.org/10.1007/BF00961734

http://dx.doi.org/10.1007/BF00961734

http://dx.doi.org/10.1007/BF00961734

paper ...papers.cnl.salk.edu/pdfs/model reduction for... · rtpr tprc r s f s ≈ ΔΔ[] Δ Δ[]...

Documents