biotechnol j 2010 song

Upload: sang-ok-song

Post on 29-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Biotechnol J 2010 Song

    1/13

    BiotechnologyJournal

    DOI 10.1002/biot.201000059 Biotechnol. J. 2010, 5, 768780

    768 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

    1 Introduction

    Mathematical modeling of signal transduction andgene expression programs is an emerging tool forunderstanding disease mechanisms. Kitano [1]suggested that analysis of molecular networks us-ing predictive computer models will play an in-creasingly important role in biomedical research.However, conventional wisdom suggests that the

    data requirement to identify and validate complexmechanistic models is too large.Molecular network

    models often exhibit complex behavior [2].Typical-ly, it is not possible to uniquely identify model pa-rameters, even with extensive training data andperfect models [3]. Thus, despite identificationstandards [4] and the integration of model identifi-cation with experimental design [5],parameter es-timation remains challenging even with structural-ly complete models. This reality has brought intothe foreground a number of interesting questions.

    For example, do we actually need exact parameterknowledge to predict qualitatively important prop-erties of a molecular network? Or can we estimatewhich components and connections are central tonetwork function given only limited parameter in-formation?

    Two schools of thought have emerged on howuncertain models can be used to understand mo-lecular network function. Bailey hypothesized thatqualitative properties of metabolic or signalingnetworks could be determined using networkstructure without parameter knowledge [6]. Cer-

    Research Article

    Ensembles of signal transduction models using Pareto Optimal

    Ensemble Techniques (POETs)

    Sang Ok Song, Anirikh Chakrabarti and Jeffrey D. Varner

    School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY, USA

    Mathematical modeling of complex gene expression programs is an emerging tool for under-

    standing disease mechanisms. However, identification of large models sometimes requires train-

    ing using qualitative, conflicting or even contradictory data sets. One strategy to address this chal-

    lenge is to estimate experimentally constrained model ensembles using multiobjective optimiza-

    tion. In this study, we used Pareto Optimal Ensemble Techniques (POETs) to identify a family of

    proof-of-concept signal transduction models. POETs integrate Simulated Annealing (SA) with

    Pareto optimality to identify models near the optimal tradeoff surface between competing training

    objectives. We modeled a prototypical-signaling network using mass-action kinetics within an or-

    dinary differential equation (ODE) framework (64 ODEs in total). The true model was used to gen-

    erate synthetic immunoblots from which the POET algorithm identified the 117 unknown model

    parameters. POET generated an ensemble of signaling models, which collectively exhibited popu-

    lation-like behavior. For example, scaled gene expression levels were approximately normally dis-

    tributed over the ensemble following the addition of extracellular ligand. Also, the ensemble re-

    covered robust and fragile features of the true model, despite significant parameter uncertainty.

    Taken together, these results suggest that experimentally constrained model ensembles could

    capture qualitatively important network features without exact parameter information.

    Keywords: Mathematical modeling Robustness and fragility Systems biology

    Correspondence: Professor Jeffrey D. Varner, School of Chemical and

    Biomolecular Engineering, 244 Olin Hall, Cornell University, Ithaca,

    NY 14853, USA

    E-mail: [email protected]

    Fax: +1-607-255-9166

    Abbreviations: ODE, ordinary differential equation; POET, Pareto Optimal

    Ensemble Technique; SA, Simulated Annealing

    Received 11 May 2010

    Revised 14 June 2010

    Accepted 21 June 2010

    Supporting information

    available online

  • 8/9/2019 Biotechnol J 2010 Song

    2/13

    2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 769

    Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com

    tainly, there is literature evidence supporting theBailey hypothesis in metabolic networks [7]. Stud-ies exploring network modularity [8] have alsoidentified recurrent motifs that betray natural de-sign principles. Alternatively, ensemble approach-es, which use uncertain model families, have alsoemerged to deal with uncertainty in systems biolo-gy and other fields like weather prediction [913].Their central value has been the ability to quantifysimulation uncertainty and to constrain model pre-dictions. For example, Gutenkunst et al. [14]showed that predictions were possible using en-sembles of signal transduction models despitesometimes only order of magnitude parameter es-timates. Beyond their ability to robustly describedata, uncertain deterministic ensembles might be acourse-grained strategy to explore population dy-namics when stochastic simulation is too expen-sive.There are several techniques to generate pa-rameter ensembles. Battogtokh et al. [10] and laterBrown et al. [12] generated experimentally con-strained parameter ensembles using a Metropolis-type random walk through parameter space. Moleset al. [15] contrasted evolutionary and determinis-tic optimization techniques,any one of which couldbe adapted for ensemble generation. However, theunifying component of these previous identifica-tion strategies has been the minimization of a sin-gle objective function.

    In this study, we used Pareto Optimal EnsembleTechniques (POETs) to identify a family of proof-of-concept signal transduction models. Our objec-tives were to test a modification to the originalPOET algorithm published by Song et al. [9] and tomore deeply explore the properties of model en-sembles.The motivation for POETs is practical.Theidentification of models with hundreds, thousandsor even tens of thousands of parameters requiresthat we use measurements from multiple laborato-ries or even different cell lines.These training datacan contain conflicts or can sometimes even becontradictory.Thus,a central challenge when iden-tifying large models is the ability to balance con-flicts in diverse training data. POETs, which inte-

    grate Simulated Annealing (SA) and multiobjectiveoptimization through the notion of Pareto rank,find solutions that optimally balance these trade-offs. The modified POETs strategy described hereimproved the performance of the original algo-rithm using a local parameter refinement step. In-terestingly, the model ensemble generated usingPOET exhibited coarse-grained heterogeneity,suggesting that deterministic ensembles could per-haps be used to model heterogeneous populations.A secondary challenge was the subsequent charac-terization of network features in a family of mod-

    els, using sensitivity analysis. Sensitivity analysishas enabled the investigation of robustness andfragility in molecular networks (see [9, 1619]).Sensitivity analysis has also been crucial to modelidentification,discrimination and experimental de-sign [3, 2023]. However, sensitivity analysis, usingfirst-order sensitivity coefficients, is a function ofthe model parameters.Thus, another open questionexplored here was whether qualitative propertiesestimated by sensitivity analysis were recovered bythe ensemble. We demonstrate that model ensem-bles recovered highly robust and fragile features ofthe true model, despite significant parameter un-certainty.

    2 Materials and methods

    2.1 Formulation, solution and analysis of themodel equations

    We identified a family of models describing agrowth factor-induced three-gene transcriptionalprogram (Fig. 1). The model is available in SBMLformat in the supplemental materials. The modelwas formulated as a set of coupled ordinary differ-ential equations (ODEs):

    (1)

    where x denotes the species concentration vector(64 1), k denotes the parameter vector (117 1)and r(x,k) denotes the vector of reaction rates(117 1). The symbol S denotes the stoichiometricmatrix (64 117).The (i,j) element ofS, denoted byij, described the relationship between protein iand ratej. Ifij 0, protein iwas produced byrj.Lastly, ifij = 0, protein iwas not involved in ratej.The symbol y denotes the model output vector,where Y denotes the measurement selection ma-trix.

    We assumed mass-action kinetics for each in-

    teraction in the network. The rate expression forreaction qwas given by:

    (2)

    The quantity {Rq} denotes the set of reactantsfor reaction q, while kq denotes the rate constantgoverning reaction q. The symbols jq denote thestoichiometric coefficients (elements ofS) for thereactants involved with reaction q. All reversibleinteractions were split into two irreversible steps;thus, every interaction in the model was non-neg-

    r k k xq q q jj

    jq

    q

    ( , )

    x

    R

    ={ }

    d

    dtto o

    xS r x k x x

    y Y

    ( , ) ( )= =

    ( ) =t xx ( )t

  • 8/9/2019 Biotechnol J 2010 Song

    3/13

    BiotechnologyJournal

    Biotechnol. J. 2010, 5, 768780

    770 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

    ative. Inactive or infrastructure proteins andmacromolecules (R1, A1, A2, iTF, iK, EXPORT, IM-PORT, PH and PH-TF),RNAP and ribosomes wereassumed to have zero-order production rates andfirst-order degradation rates. These rate constantswere estimated along with the binding and catalyt-ic model parameters. All initial conditions werezero except gene 1, 2, and 3 (1 if present, 0 if ab-sent). We accounted for membrane, cytosolic andnuclear proteins and mRNA by explicitly definingseparate species in each of these compartments.

    Mass-action kinetics, while expanding the di-mension of the model, regularized its mathematicalstructure.This allowed automatic generation of themodel code using the UNIVERSAL code genera-tion tool. UNIVERSAL, an open source Java code-generator, supports the generation of model codefrom text and SBML files. UNIVERSAL currentlysupports multiple code types (Matlab/Octave-M,Octave-C, Sundials-C, GSL-C and Scilab) and it isextensible with a simple plugin API. UNIVERSALis freely available as a Google Code project. Modelcode was generated as a C++ Octave module and

    solved using the LSODE routine of Octave (www.octave.org). When calculating the response of themodel to ligand, we ran the model to steady-stateand then simulated the addition of ligand. Thesteady-state was estimated numerically by repeat-edly solving the model equations and estimatingthe difference between subsequent time points:

    (3)

    The quantities x(t) and x(t +t) denote the sim-ulated concentration vector at time tand t+t, re-

    x x( ) ( )t t t+ 2

    spectively.TheL2vector-norm was used as the dis-tance metric.We used t= 100 s and = 0.01 for allsimulations.

    Sensitivity analysis was used to estimate whichnetwork components were fragile or robust. First-order sensitivity coefficients at time tq:

    (4)

    were computed by solving the kinetic-sensitivityequations [24]:

    (5)

    subject to the initial condition sj(t0)= 0.The quanti-tyj denotes the parameter index, P denotes thenumber of parameters in the model, A denotes the Jacobian matrix, and bj denotes the j

    th column ofthe matrix of first derivatives of the mass balances with respect to the parameters. Sensitivity coeffi-cients were calculated by repeatedly solving theextended kinetic-sensitivity system for each pa-rameter using the LSODE routine of OCTAVE

    (www.octave.org) over a sparse sampling (approxi-mately 10%) of the ensemble (see Fig. 3). The Jaco-bian A and the bj vector were calculated at eachtime step using their analytical expressions gener-ated by UNIVERSAL.The resulting sensitivity co-efficients were then scaled and time-averaged(Trapezoid rule):

    (6)

    where Tdenotes the final simulation time and ij =1 (unscaled) or ij(t) = kj/xi(t) (scaled). The scaled

    Nij ij

    T

    Tdt t t ( ) ( ) 1 0 sij

    ds dt d dt

    tj j//

    ( )x

    A bS r( ) = +( ) st j x k( , )

    = , , ,j P1 2

    s tx

    kij qi

    jtq

    ( ) =

    Cytosol

    Nucleus

    Extracellular

    Import

    Export

    R1L

    Adaptor

    gene 1

    gene 2

    gene 3

    R1L

    R1 L

    53

    mRNA

    53

    Translation

    Protein

    53

    mRNA

    P1

    P2P3

    P2

    Px Px

    P1

    P2

    P3

    P1

    aTF

    aTF

    aTF

    iTF

    iK

    aK

    PH

    TF-PH

    Figure 1. Schematic of the proto-

    typical signaling network used in

    this study. Extracellular ligand L

    binds surface receptor R1 driving the

    phosphorylation of transcription

    factor TF. TFup-regulates gene 1

    expression. Gene 1 then initiates a

    cascade resulting in the expression

    of gene 2 and gene 3. Gene 3 down-

    regulates the expression of gene 1.

    The model is available in SBML for-

    mat in the supplemental materials.

  • 8/9/2019 Biotechnol J 2010 Song

    4/13

    2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 771

    Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com

    time-averaged sensitivity coefficients were thenorganized into an array for each ensemble mem-ber:

    (7)

    where denotes the index of the ensemble mem-ber,P denotes the number of parameters, N de-notes the number of ensemble samples andM de-notes the number of model species. The Bi matrixcontained the time-averaged sensitivities for a sin-gle species for each parameter (rows) as a functionof the ensemble (columns):

    (8)

    To estimate the relative fragility or robustnessof species and reactions in the network,we decom-posed the N() or the Bi matrices using SingularValue Decomposition (SVD):

    (9)

    Coefficients of the left (right) singular vectorscorresponding to largest singular values ofN()

    were rank-ordered to estimate important species(reaction) combinations. Only coefficients withmagnitude greater than a threshold ( =0.1) wereconsidered.The fraction of the vectors in which areaction or species index occurred was used to rankits importance.Similarly, the left singular vectors ofBi showed which reaction combinations were im-portant for species i, while the right singular vec-tors rank-ordered which ensemble members con-tributed most significantly to the sensitivity of

    species i.

    2.2 POETs

    POETs integrate SA with Pareto optimality to esti-mate parameter sets on or near the optimal trade-off surface between competing training objectives(Fig. S1). Here, we modified the original algorithm[9] to improve its convergence properties.Denote acandidate parameter set at iteration i +1 as ki+1.Thesquared error for ki+1 for training set j was definedas:

    N

    B U S V

    ( ) ( ) ( ) , ( )

    ( ) ( ) , (

    = =

    U VT

    i i i iT ))

    B

    N N N N

    N Ni

    i i i i

    N

    i i=

    ( ) ( ) ( ) ( )

    ( ) (1

    1

    1

    2

    1 1

    21

    2

    222 2

    1 2

    ) ( ) ( )

    ( ) ( ) ( )

    N N

    N N N N

    i iN

    iP iP iP

    iiPN

    i M

    ( )

    , , ,

    = 1 2

    N

    N N N N

    N N( )

    ( ) ( ) ( ) ( )

    ( )

    =

    11 12 1 1

    21 22

    j P(( ) ( ) ( )

    ( ) ( ) ( )

    N N

    N N N N

    2 2

    1 2

    j P

    M M Mj MMP

    N

    ( )

    , , ,

    = 1 2

    (10)

    The symbol Mij denotes scaled experimentalobservations (from training setj), while the symbolyij denotes the scaled simulation output (from train-ing setj). The quantityi denotes the sampled timeindex and Tj denotes the number of time points forexperimentj. We assumed only immunoblots wereavailable for training with the exception of a singleqRT-PCR or ELISA measurement of the highest in-tensity band. The first term in the objective func-tion quantified the relative simulation error. Theread-out from the training immunoblots was bandintensity where we assumed intensity was onlyloosely proportional to concentration. Suppose wehave the intensity for speciesx at time i = {t1,t2,..,tn}in condition j. The scaled-value measurementwould then be given by:

    (11)

    Under this scaling, the lowest intensity bandequaled zero, while the highest intensity bandequaled one. A similar scaling was defined for thesimulation output.The second term in the objectivefunction quantified the error in the estimated con-centration scale. We assumed only the highest in-tensity bands were quantified absolutely (denotedbyMij) and compared with the simulation. Howev-er, if these measurements were not available, thesecond term could be adjusted to ensure the mod-el operated on physiologically relevant concentra-tion scales.

    We computed the Pareto rank of ki+1 by com-paring the simulation error at iteration i +1 againstthe simulation archive Ki.We used the Fonseca andFleming ranking scheme [25]:

    rank (ki+1|Ki) =p (12)

    wherep denotes the number of parameter sets thatdominate parameter set ki+1. Parameter sets on or

    near the optimal trade-off surface have small rank(

  • 8/9/2019 Biotechnol J 2010 Song

    5/13

    BiotechnologyJournal

    Biotechnol. J. 2010, 5, 768780

    772 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

    was discretized into 10 quanta between To and Tfand adjusted according to the schedule Tk =

    kT0where was defined as:

    (14)

    The epoch-counter kwas incremented after theaddition of 50 members to the ensemble. Thus, asthe ensemble grew, the likelihood of accepting pa-rameter sets with a large Pareto rank decreased.Togenerate parameter diversity, we randomly per-turbed each parameter by 50%. However, in ad-dition to a random-walk strategy (previous algo-rithm), we performed a local pattern search everyqsteps to minimize the residual for a single random-ly selected objective.The local pattern-search algo-rithm has been described previously [26, 27]. Theparameter ensemble used in the simulation andsensitivity studies was generated from the low-rank parameter sets in Ki.

    3 Results

    3.1 Summary

    We identified and analyzed a family of canonicalsignal transduction models using POETs and sen-sitivity analysis. POET has previously been used toidentify molecular models of pain signaling [9].Wemodified the original algorithm by integrating a lo-cal pattern-search routine, which better controlledthe absolute error in the ensemble identification.The original and modified algorithms were used toestimate an ensemble of signaling models. Themodel, which was assumed to have a known net-work structure, described the integration of extra-cellular signals with kinase activation, the phos-phorylation of transcription factors, and the up-regulation of an associated transcriptional program(Fig. 1). Thus, while not specific to a particulargrowth factor, signaling cascade or expression pro-gram, it contained many of the general features en-

    countered when identifying specific models. Wemodeled the molecular interactions in the proto-typical-signaling network using mass-action kinet-ics within an ODE framework. ODEs and mass-ac-tion kinetics are common methods of modeling bi-ological pathways [9, 1618, 2832]. We assumedspatial homogeneity but differentiated between cy-tosolic,membrane and nuclear localized processes.The true model (known parameters) was used togenerate synthetic data from which we tested thePOET algorithm. Each synthetic measurement wasassumed to be a Northern or Western blot.Thus,we

    T

    Tf

    o

    /

    =

    1 10

    knew only relativeamounts of protein or mRNA forany specific condition or time.To constrain the ab-solute concentration scale, we assumed a singleELISA or qRT-PCR measurement for the highestintensity band in each case. Lastly, we limited ourtraining data to 20 samples per experiment (an up-per limit on the lanes available on a Western blot).

    The modified POET algorithm performed betterthan the original implementation and generated anensemble that collectively exhibited population-like behavior. First, the ODE model used here wasdeterministic and did not describe stochastic geneexpression fluctuations. However, because manydifferent parameter sets were sampled, the deter-ministic ensemble exhibited population-like be-havior. For example, scaled gene expression levelswere approximately normally distributed followingthe addition of extracellular ligand. Thus, whilegene expression was not described at a single-celllevel, the ensemble captured coarse-grained ex-pression heterogeneity. This suggested that deter-ministic ensembles could perhaps be used to mod-el heterogeneous populations. Second, the modelensemble captured the robust and fragile featuresof the true model, despite significant parameteruncertainty. Edge (interactions between species)and node (species) ranks computed over the en-semble using sensitivity analysis were consistent with the true rankings, at least for highly fragileand robust network components. This suggestedthat, in practice, results from sensitivity analysisobtained by analyzing model ensembles could rep-resent true behavior to a high degree of certainty, atleast for highly fragile or robust network features.The true model is available in SBML format in thesupplemental materials.

    3.2 Estimating an ensemble of models usingmultiobjective optimization

    We estimated an ensemble of signal transductionmodels from synthetic data sets using POET(Fig.S1).The canonical model had 117 unknown ki-netic constants, primarily of three types (associa-

    tion, dissociation or catalytic rate constants). Be-cause we used mass-action kinetics, every networkinteraction was governed by a single parameter.Using the true model, we generated 24 syntheticdata sets using a (3,2,2,2)-level factorial design.Thedesign variables considered were the level of lig-and stimulation (L = 0,L = 10 andL = 50) and thepresence and absence of gene 1, 2 and 3. In eachdata set, we assumed inactivated/activated kinase(cytosol), inactivated/activated transcription factor(cytosol), mRNA for protein 1 (cytosol) and the cy-tosolic level of protein 1 were measured at 20 points

  • 8/9/2019 Biotechnol J 2010 Song

    6/13

    2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 773

    Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com

    equidistant over the time-course of the experiment(approximately 3 h). Each synthetic dataset be-came an objective in the optimization calculationfrom which we estimated the model ensemble (24objectives in total).

    The POET algorithm with local parameter re-finement performed better than the original imple-mentation (Fig. 2). Both implementations startedfrom the same randomized parameter seed, usedthe same software libraries and were run over a72-h period on the same hardware. Both imple-mentations used a maximum acceptable Paretorank of three or less.The modified algorithm gen-erated 2882 ranked sets, of which 1062 had a Pare-to rank equal to zero (Fig. 2, black circles). On theother hand, the original POET implementationgenerated 20645 ranked sets, where 1538 had aPareto rank equal to zero (Fig.2, grey circles).Whilelocal refinement required additional function eval-uations, the median training residuals were lessthan the original implementation (Fig. S2). Thequality of the resulting ensemble generated withlocal refinement was also higher. Approximately47% of the model parameters (55 of 117) were con-

    strained with a coefficient of variation (CV) of lessthan or equal to one (Fig. 3A). In comparison, theminimum CV produced by the original implemen-tation was 1.7 (Fig. 3B). The top five constrainedparameters were protein 1 (cytosol), RNAP andEXPORT degradation (all 0.64), the degradation ofmRNA for gene 3 (0.65;negative regulator of P1 ex-pression) and the constitutive expression of gene 1(0.67). The top five least-constrained parameterswere associated with kinase regulation or regulat-ed gene 1 expression (CV >2).Well-constrained pa-rameters were pseudo-normally distributed with a

    strong positive skew, while parameters with a highCV were approximately exponentially distributed(Fig. S3). Analysis of the residuals produced byPOET gave insight into relationships in the train-ing data (Fig. 2). For example, O6 O2 and similar-

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    2.2

    2.4

    0 20 40 60 80 100 120

    Sorted Parameter Index

    ParameterCoefficientofVariation(CV)

    2

    2.5

    3

    3.5

    4

    4.5

    5

    0 20 40 60 80 100 120

    Paramete

    rCoefficientofVariation(CV)

    A

    B

    Figure 3. Coefficient of variation (CV) of model parameters estimated us-

    ing POET with local parameter refinement (A) and the original implemen-

    tation (B). The solid line denotes the mean CV calculated over the

    entire ensemble, while the points denote the CV of the ensemble sample

    used in the sensitivity analysis calculations. Approximately 47% or 55 of

    117 parameters had CV 1 for POET with local refinement. The minimumCV obtained using the original POET implementation was 1.7.

    Figure 2. Objective function array for pa-

    rameter sets with Pareto rank = 0 for the

    original POET implementation (gray cir-

    cles) and POET with local parameter re-

    finement (black circles). Eight objectivesare shown from the 24 objectives used in

    the model identification. The symbol

    Ojindicates the jth objective function.

    Points indicate the error associated with

    ensemble parameter sets. Objectives

    were defined using a (3,2,2,2)-level facto-

    rial design (ligand,gene1,gene2,gene3):

    O1 = (2,2,1,1), O2 = (2,2,1,2), O3 =

    (2,2,2,1), O4 = (2,2,2,2), O5 = (3,2,1,1),

    O6 = (3,2,1,2), O7 = (3,2,2,1) and O8 =

    (3,2,2,2). Design levels: ligand (1,2,3) =

    (0,10,50) and genej(1,2) = (deleted,

    present).

  • 8/9/2019 Biotechnol J 2010 Song

    7/13

  • 8/9/2019 Biotechnol J 2010 Song

    8/13

    2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 775

    Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com

    parameter ensemble for two key catalytic reactionsin our network, namely, the activation of kinase byactivated receptor and the phosphorylation of tran-scription factor by activated kinase to determine ifthe MichaelisMenten assumption was valid. Weconsidered parameter sets from the locally refinedparameter ensemble with Pareto rank 3. Forthese reactions,the on- and catalytic rate constantshad a CV 1, while the off-rates were not well con-strained (CV > 2). On average, the Michaelis-Menten assumption was violated by 35% of theensemble,suggesting that we could possibly reducemodel complexity by changing the kinetics. How-ever, mass-action kinetics have the advantages ofregularized mathematical structure and simplicity,which offsets the added complexity.

    3.3 Rank-based assessment of nodes and edges wasconserved by the ensemble

    A key question when using model ensembles iswhether the rank-based assessment of critical net-

    work components is correct, given significant para-metric diversity. Previously, we approached thisquestion by comparing the nodes or edges predict-ed to be important in a variety of models with liter-ature [9, 17, 19, 33]. However, these comparisonswere imperfect. Many factors were likely differentbetween the experimental and modeling studies.Moreover, these comparisons were only as reliableas the underlying literature search, which was notexhaustive. In this study, we validated the classifi-cation of nodes and edges as fragile or robust bycomparing the true model with models from theensemble.

    Local processes such as transcription factorregulation and global infrastructure like RNAP, nu-clear transport and translation were the most frag-ile components of the prototypical-signaling net-work.First-order sensitivity coefficients were com-puted for the true parameters and the ensemble.These coefficients were then time-averaged toform the Nand Barrays (see Materials and meth-ods). The magnitude of the coefficients of the left

    0

    1

    2

    3

    4

    5

    6

    7

    0 0.5 1 1.5 2 2.5 3

    0

    2

    4

    6

    8

    10

    12

    0 0.5 1 1.5 2 2.5 30

    1

    2

    3

    4

    5

    6

    7

    0 0.5 1 1.5 2 2.5 3

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    0 0.5 1 1.5 2 2.5 3

    Time

    Time Time

    Time

    Protein1cytosol

    (p1C)(A.U)

    P

    rotein2cytosol

    (p2C)(A.U)

    Protein3cytosol

    (p3C)(A.U)

    mRNAProtein2

    Cytosol(A.U)

    (A) (B)

    (C) (D)

    O7

    O8

    O8

    O8

    Figure 5. Model predictions following the addition of ligand L versus modified synthetic data. The dashed lines denote the mean simulated value over the

    ensemble; the gray region denotes the 95% confidence interval. The points denote the mean synthetic data used to validate the model. The validation data

    was generated from the training data by adding a background level of the ligand (L =1) and by considering species not used for training (with the exception

    of protein 1). (A) Cytosolic levels of protein 1 versus time. Points denote the O7 = (3,2,2,1) data set in the presence of background ligand (L =1). (B) Cytoso-

    lic levels of protein 3 versus time. Points denote the O8 = (3,2,2,2) data set in the presence of background ligand (L =1). (C) Cytosolic levels of protein 2 ver-

    sus time. Points denote the O8 = (3,2,2,2) data set in the presence of background ligand (L =1). (D) Cytosolic levels of mRNA for protein 2 versus time.

    Points denote the O8 = (3,2,2,2) data set in the presence of background ligand (L =1).

  • 8/9/2019 Biotechnol J 2010 Song

    9/13

    BiotechnologyJournal

    Biotechnol. J. 2010, 5, 768780

    776 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

    (right) singular vectors corresponding to largest singular values ofNwere used to rank-order theimportance of the nodes (edges) in the model(Fig.7).The most sensitive node combinations with=1 involved the regulation of activated transcrip-tion factor (aTF) and the transport of aTF into thenucleus (Fig. 7, top). Similarly, the most sensitiveedges involved PH-TF regulation of aTF, the pro-duction, degradation and regulation of the specifickinase for TF (iK/aK), the production and degrada-

    tion of iTF and the production/degradation of PH-TF. Analysis of additional singular vectors (in-creased ) highlighted the role of global infrastruc-ture like RNAP, nuclear transport (IMPORT/EX-PORT) and translation (Fig. 7, middle and bottom).Analysis of the left singular vectors of the BaTFma-trix also supported these findings. On the otherhand, the most robust species and reaction combi-nations involved the assembly of the adaptor com-plex and the basal expression of gene 1, 2 and 3.Subpopulations in the ensemble behaved differ-ently. Analysis of the right singular vectors ofBaTF

    suggested which ensemble elements most influ-enced a particular species. For example, examina-tion of the top and bottom three ranked ensemblemembers,estimated from the right singular vectorsof BaTF , showed the highest ranked ensemblemembers had similar aTF trajectories (Fig. S5, sol-id-lines). Conversely, the lowest three had widely varying aTF levels (Fig. S5, dashed-lines). Thus,subpopulations with qualitatively distinct behaviorwere present in the ensemble and decomposing theB

    array could identify these elements.Edge and node ranks computed over the en-semble recovered the true rankings for highly frag-ile and highly robust network components (Fig. 8).We compared the node (species) and edge (inter-action) ranks computed using sensitivity analysisfor the true parameter set with the ensemble (=1).The Kendall and Spearman rank correlations wereused to quantify the agreement between the trueand estimated ranked lists (Table 1).The Spearmanand Kendall correlation coefficients were approxi-mately normally distributed for both node and edge

    0

    200

    400

    600

    00

    50

    100

    150

    200

    -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

    0

    40

    80

    120

    -0.2 0 0.2 0.4 0.6 0.8 10

    40

    80

    120

    0 0.2 0.4 0.6 0.8 1 1.2

    0

    40

    80

    120

    0.2 0.4 0.6 0.8 1 1.20

    50

    100

    150

    200

    0.2 0.4 0.6 0.8 1 1.2

    0

    100

    200

    300

    400

    0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.20

    100

    300

    500

    0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

    t = 0.10 hr t = 0.20 hr

    t = 0.25 hr t = 0.30 hr

    t = 0.40 hr t = 0.50 hr

    t = 0.75 hr t = 1.0 hr

    NumberofCells

    Scaled protein concentration (A.U)

    Figure 6. Distributions for the scaled cytosolic protein 1 concentration as a function of time following the addition of extracellular ligand L. Bars denote

    expression bins for protein 1 (expression levels were sub-divided into 10 bins). The solid line denotes a normal distribution fit to the histogram (histfit

    function of Octave). Initially, the ensemble was synchronized with low scaled protein 1 expression (upper left-hand plot). After the addition of the ligand L

    the distribution of cells expressing protein 1 shifted to the right (progressing through an approximately normal distribution during active expression of pro-

    tein 1). After t =1.0 h, the bulk of the cells reached their maximum cytosolic levels of protein 1 (lower right-hand corner).

  • 8/9/2019 Biotechnol J 2010 Song

    10/13

    2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 777

    Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com

    fragility over the model ensemble (data not shown).Ranks estimated using unscaled sensitivity coeffi-cients gave the best correlation with the true pa-rameter values. The Kendall correlation betweenthe true node rank and that estimated from the en-semble was 0.57 0.15, while the mean edge rankcorrelation was 0.72 0.09. The mean Spearmanrank correlation for node rank was 0.73 0.16, while the mean correlation for edge rank was0.87 0.08.Additionally, if we computed the corre-lation between the true rank and the meannode/edge rank (mean rank calculated over the en-semble before the rank correlation test), the Spear-man correlation for nodes and edges increased to0.91 and 0.97, respectively. Both correlation metricsand visual inspection (Fig. 8, control versus POET)suggested that edge rank was recovered better thannode rank. In addition to the rank correlation, wecalculated the fraction of the ensemble in which an

    edge or node was ranked the same as the true pa-rameter set (Fig. 8, bottom). Interestingly, bothhighly fragile and highly robust network featureswere recovered for edges (Fig. 8, bottom left) andnodes (Fig. 8, bottom right). For example, the high-est and lowest ranked edges were recovered inmore than 95% of the ensemble. However, minornetwork features were not similarly recovered(worst case recovery of only 20%). This suggestedthat we could expect to recover at least highly frag-ile or robust network features when using para-metrically uncertain ensembles.

    4 Discussion

    Mathematical modeling of complex gene expres-sion programs is an emerging tool for understand-ing disease mechanisms. However, identification oflarge models with many unknown parameters re-quires that we use diverse training data. Trainingdata taken from many sources can contain con-flicts, for example different time scales, or cansometimes even be contradictory. Parameter esti-mation techniques that balance these conflictsmight lead to robust model performance. POET haspreviously been used to identify molecular modelsof pain signaling [9].We modified the original algo-rithm by incorporating a local parameter refine-ment step which generated candidate parametersets with better error properties. Using the modi-fied POET algorithm,we identified an ensemble of

    Control ensemble

    (perfect information)POET ensemble

    (uncertain parameters)

    = 1

    = 10

    = 20

    Fragile

    Robust

    Fragile

    Robust

    Fragile

    Robust

    Figure 7. Comparison of the species (node)

    fragility estimated from the ensemble versus

    the true parameter set for different values

    (= 0.1). The fraction of the top-modes in

    which a species was present was calculated

    for the true model (left) and the model

    ensemble (right).

    Table 1. Summary of the rank correlation for node and edge ranking

    between the ensemble and true parameter set

    Method Node Edge

    Scaled

    Kendall 0.51 0.18 0.36 0.11

    Spearman 0.65 0.22 0.51 0.15

    Unscaled

    Kendall 0.57 0.15 0.72 0.09

    Spearman 0.73 0.16 0.87 0.08

  • 8/9/2019 Biotechnol J 2010 Song

    11/13

    BiotechnologyJournal

    Biotechnol. J. 2010, 5, 768780

    778 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

    parameter sets from synthetic data generated usingthe true parameters.We assumed that immunoblottraining data (Western or Northern blots) wereavailable to estimate the model ensemble. We in-troduced a systematic procedure to incorporate

    these types of experimental measurements intomodel identification. We characterized the param-eter ensemble generated by POET by exploring thebehavioral diversity of models in the ensemble andby examining how the fragility of nodes or edgesvaried over the ensemble.

    The deterministic ensemble exhibited hetero-geneous population-like behavior. In this study, wesuggested that deterministic ensembles could beused to model heterogeneous populations in situa-tions where stochastic computation was not feasi-ble.There is a rich and growing literature exploring

    the role of stochastic fluctuations in biologicalprocesses such as gene expression [34].Today, sto-chastic gene expression models are not computa-tionally feasible except for small networks. Howev-er, as stochastic simulation algorithms continue to

    improve, for example with hybrid [35] or leapingstrategies [36], then fully stochastic simulationswill become tractable. Currently, the simulation ofmoderate to large problems typically relies on thepopulation-averaged descriptions provided byODEs.Within an ODE framework,we showed pop-ulation-like effects using model ensembles. Popu-lation heterogeneity using deterministic modelfamilies was also recently explored for bacterialgrowth in batch cultures [37]. Distributions weregenerated because the model parameters variedover the ensemble, i.e., extrinsic noise led to popu-

    Robust Fragile

    100%

    20%

    40%

    60%

    80%

    ReactionIndex

    ReactionIndex

    Ensemble Index

    Perc

    entagecorrectclassification

    SpeciesIndex

    Spe

    ciesIndex

    Ensemble Index

    Sorted Species Index

    100%

    20%

    40%

    60%

    80%

    Perce

    ntagecorrectclassification

    Control ensemble(perfect information)

    POET ensemble(uncertain parameters)

    Reactions Species

    Robust Fragile

    Fragile RobustSorted Reaction Index

    Fragile Robust

    Figure 8. Comparison of the reaction (edge) and species (node) rank estimated from the ensemble versus the true parameter set for =1. The ordinal rank

    of the magnitude of the left (right) singular vector corresponding to the largest singular value was computed for true model (top) and the model ensemble

    (middle). The fraction of trials in which a species or reaction was ranked exactly correctly was used to calculate the correct classification percentage.

  • 8/9/2019 Biotechnol J 2010 Song

    12/13

    2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 779

    Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com

    lation heterogeneity. Parameters controlling physi-cal interactions, such as disassociation rates, or therate of assembly or degradation of macromolecularmachinery, such as ribosomes, were widely distrib-uted over the ensemble. However, population het-erogeneity can also arise from intrinsic noise [38].Thus, deterministic ensembles, which do not cap-ture intrinsic thermal fluctuations, provide acoarse-grained or extrinsic-only ability to simulatepopulation diversity. Taken together, these studiesmotivate a deeper question as to whether a uniqueparameter set exists in biology. These results sug-gest that not just variation in the copy number ofinfrastructure like ribosomes or RNAP but ratherdistributions in the strength of biophysical interac-tions could also drive population heterogeneity.More studies are required to explore these ques-tions and to test the notion that ensembles canmodel population heterogeneity. One concrete nextstep could be to try and recapitulate experimental-ly measured distributions,for example, flow cytom-etry measurements of protein markers. Longerterm, coarse-grained deterministic ensemblesmight be a strategy to explore drug effects acrosscell populations [1].

    Sensitivity-based metrics, calculated from un-certain models, are often used to estimate whichcomponents of networks are fragile or robust.Thus,a reasonable question is whether the classificationof nodes (species) and edges (interactions) as frag-ile or robust in uncertain models is correct.We ex-plored this question by comparing nodes or edgesestimated to be fragile or robust in the true modelwith those of the model ensemble.We showed thatboth locally and globally important network fea-tures were conserved across the ensemble. Themost important local feature of our canonical net-work was transcription factor activation.Transcrip-tion factor regulation is a well-known integrationlayer in gene-expression architectures. For exam-ple, Bhardwaj et al. [39] showed in a range of net-works that midlevel regulators, such as transcrip-tion factors,have the highest collaborative propen-sity.Thus, transcription factor regulation is perhaps

    one of the bow-ties described by Csete and Doyle[40]. Sensitivity analysis suggested that global in-frastructure such as RNAP, nuclear transport andtranslation initiation were also fragile.The fragilityof transcription and translation infrastructure hasalso been reported by Stelling et al. [16] exploringthe robustness properties ofDrosophila clock ar-chitectures, in cell-cycle architectures [19], and ingrowth factor signaling in LNCaP sub-clones [33],to cite just a few examples. Interestingly, highlyfragile or robust network features were conservedacross the ensemble. This suggested, as Bailey hy-

    pothesized, that analysis of experimentally con-strained model ensembles could generate a rea-sonable estimate of what was important in a net- work without detailed parametric knowledge [6].However, sensitivity analysis does not evaluate net- work performance following structural or opera-tional perturbations [41]. Thus, an open question(yet to be explored) is whether an ensemble ofmodels captures the fault tolerance or disturbancerejection properties of molecular networks.

    The project described was supported by Award Num-ber #U54CA143876 from the National Cancer Insti-

    tute.The content is solely the responsibility of the au-thors and does not necessarily represent the officialviews of the National Cancer Institute or the Nation-al Institutes of Health.We also acknowledge the gen-erous support of the Office of Naval Research#N000140610293 to J.V. for the support of S.S.

    The authors have declared no conflict of interest.

    5 References

    [1] Kitano, H.,A robustness based approach to systems-orient-

    ed drug design.Nat.Rev.Drug Discov. 2007, 6, 202210.

    [2] Hornberg,J.J,Binder, B.,Bruggeman,F. J.,Schoeberl,B. et al.,

    Control of mapk signalling: from complexity to what really

    matters. Oncogene 2005,24, 55335542.

    [3] Gadkar, K. G., Varner, J., Doyle, F. J., Model identification of

    signal transduction networks from data using a state regu-

    lator problem. Syst.Biol. (Stevenage) 2005,2, 1730.

    [4] Gennemark, P., Wedelin, D., Benchmarks for identification

    of ordinary differential equations from time series data.

    Bioinformatics 2009,25, 780786.

    [5] Bandara, S., Schlder, J., Eils,R., Bock, H. G., Meyer,T., Opti-

    mal experimental design for parameter estimation of a cell

    signaling model.PLoS Comput.Biol. 2009,5, e1000558.

    [6] Bailey, J. E., Complex biology with no parameters. Nat.

    Biotechnol. 2001, 19, 503504.

    [7] Covert, M.,Knight, E.,Reed,J.,Herrgard, M.,Palsson, B., In-

    tegrating high-throughput and computational data eluci-

    dates bacterial networks.Nature 2004,429, 9296.

    [8] Shen-Orr, S. S., Milo, R., Mangan, S.,Alon, U., Network mo-

    tifs in the transcriptional regulation network ofEscherichia

    coli.Nature 2002,31, 6468.

    [9] Song, S. O., Varner, J., Modeling and analysis of the molecu-

    lar basis of pain in sensory neurons. PLoS One 2009, 4,

    e6758.

    [10] Battogtokh, D., Asch, D. K., Case, M. E.,Arnold, J., Schuttler,

    H. B., An ensemble method for identifying regulatory cir-

    cuits with special reference to the qa gene cluster ofNeu-

    rospora crassa. Proc. Natl. Acad. Sci. USA 2002, 99,

    1690416909.

    [11] Kuepfer, L., Peter, M., Sauer, U., Stelling, J., Ensemble mod-

    eling for analysis of cell signaling dynamics.Nat.Biotechnol.

    2007,25, 10011006.

    [12] Brown, K. S., Sethna, J. P., Statistical mechanical approach-

    es to models with many poorly known parameters. Phys.

    Rev.E Stat.Nonlin. Soft Matter Phys. 2003, 68, 021904.

  • 8/9/2019 Biotechnol J 2010 Song

    13/13

    BiotechnologyJournal

    Biotechnol. J. 2010, 5, 768780

    780 2010 Wiley VCH Verlag GmbH & Co KGaA Weinheim

    [13] Palmer, T., Shutts, G., Hagedorn, R., Doblas-Reyes, F. et al.,

    Representing model uncertainty in weather and climate

    prediction.Annu. Rev.Earth Planetary Sci. 2005,33, 163193.

    [14] Gutenkunst, R. N.,Waterfall, J. J., Casey, F. P., Brown, K. S. et

    al., Universally sloppy parameter sensitivities in systems

    biology models.PLoS Comput.Biol. 2007,3, 18711878.

    [15] Moles, C. G., Mendes, P., Banga, J. R., Parameter estimation

    in biochemical pathways: a comparison of global optimiza-

    tion methods. Genome Res. 2003, 13, 24672474.

    [16] Stelling, J., Gilles, E. D., Doyle, F. J., Robustness properties of

    circadian clock architectures. Proc. Natl. Acad. Sci. USA

    2004, 101, 1321013215.

    [17] Luan, D., Zai, M., Varner, J. D., Computationally derived

    points of fragility of a human cascade are consistent with

    current therapeutic strategies.PLoS Comput.Biol. 2007,3,

    e142.

    [18] Chen, W. W., Schoeberl, B., Jasper, P. J., Niepel, M. et al., In-

    put-output behavior of erbb signaling pathways as revealed

    by a mass action model trained against dynamic data.Mol.

    Syst.Biol. 2009,5, 239.

    [19] Nayak, S., Salim, S., Luan, D., Zai, M., Varner, J. D., A test of

    highly optimized tolerance reveals fragile cell-cycle mech-

    anisms are molecular targets in clinical cancer trials.PLoS

    One 2008,3, e2016.

    [20] Kholodenko, B. N., Kiyatkin,A., Bruggeman, F. J., Sontag, E.

    et al., Untangling the wires:a strategy to trace functional in-

    teractions in signaling and gene networks.Proc.Natl.Acad.

    Sci. USA 2002, 99, 1284112846.

    [21] Kremling, A., Fischer, S., Gadkar, K. G., Doyle. F. J. et al., A

    benchmark for methods in reverse engineering and model

    discrimination:Problem formulation and solutions.Genome

    Res. 2004, 14, 17731785.

    [22] Gutenkunst, R. N.,Waterfall, J. J., Casey, F. P., Brown, K. S. et

    al., Universally sloppy parameter sensitivities in systems

    biology.PLoS Comput.Biol. 2007,3, e198.

    [23] Casey, F. P., Baird, D., Feng, Q., Gutenkunst, R. N. et al., Opti-

    mal experimental design in an EGFR signaling and down-regulation model.IET Syst.Biol. 2007, 1, 190202.

    [24] Dickinson, R. P., Gelinas, R. J., Sensitivity analysis of ordi-

    nary differential equation systems A direct method. J.

    Comp.Phys. 1976,21, 123143.

    [25] Fonseca, C., Fleming, P. J., Genetic algorithms for multiob-

    jective optimization: Formulation, discussion and general-

    ization,in:Proceedings of the 5th International Conference on

    Genetic Algorithms, Morgan Kaufmann, San Mateo 1993, pp.

    416423.

    [26] Gadkar, K. G., Doyle, F. J. 3rd, Crowley, T. J.,Varner, J. D., Cy-

    bernetic model predictive control of a continuous bioreac-

    tor with cell recycle.Biotechnol.Prog. 2003, 19, 14871497.

    [27] Varner, J. D., Large-scale prediction of phenotype: Concept.

    Biotechnol.Bioeng. 2000, 69, 664678.

    [28] Fussenegger, M., Bailey, J.,Varner, J., A mathematical model

    of caspase function in apoptosis.Nat.Biotechnol. 2000, 18,

    768774.

    [29] Schoeberl, B., Eichler-Jonsson, C., Gilles, E. D., Mller, G.,

    Computational modeling of the dynamics of the map kinase

    cascade activated by surface and internalized egf receptors.

    Nat.Biotechnol. 2002,20, 370375.

    [30] Li, H., Ung,C.Y.,Ma, X.H., Liu,X. H. et al., Pathway sensi-

    tivity analysis for detecting pro-proliferation activities of

    oncogenes and tumor suppressors of epidermal growth fac-

    tor receptor-extracellular signal-regulated protein kinase

    pathway at altered protein levels. Cancer 2009, 115,

    42464263.

    [31] Stites, E.C.,Trampont,P. C., Ma,Z.,Ravichandran,K. S., Net-

    work analysis of oncogenic ras activation in cancer. Science

    2007,318, 463467.

    [32] Helmy, M.,Gohda, J., Inoue, J. I.,Tomita, M. et al., Predicting

    novel features of toll-like receptor 3 signaling in

    macrophages.PLoS One 2009,4, e4661.

    [33] Tasseff, R.,Nayak, S., Salim,S., Kaushik, P. et al.,Analysis of

    the molecular networks in androgen dependent and inde-

    pendent prostate cancer revealed fragile and robust sub-

    systems.PLoS One 2010,5, e8864.

    [34] Elowitz, M. B., Levine, A. J., Siggia, E. D., Swain, P. S., Sto-

    chastic gene expression in a single cell. Science 2002,297,

    11831186.

    [35] Iyengar, K. A., Harris, L. A., Clancy, P., Accurate implemen-

    tation of leaping in space: The spatial partitioned-leaping

    algorithm.J. Chem.Phys. 2010, 132, 094101.

    [36] Cao, Y., Petzold, L. R., Rathinam, M., Gillespie, D. T.,The nu-

    merical stability of leaping methods for stochastic simula-

    tion of chemically reacting systems.J. Chem.Phys. 2004, 121,

    1216912178.

    [37] Lee, M.W.,Vassiliadis,V. S.,Park,J. M., Individual-based and

    stochastic modeling of cell population dynamics consider-

    ing substrate dependency. Biotechnol. Bioeng. 2009, 103,

    891899.[38] Swain,P. S., Elowitz, M. B., Siggia, E. D., Intrinsic and extrin-

    sic contributions to stochasticity in gene expression.Proc.

    Natl.Acad. Sci. USA 2002, 99, 1279512800.

    [39] Bhardwaj,N.,Yan,K. K., Gerstein,M. B., Analysis of diverse

    regulatory networks in a hierarchical context shows consis-

    tent tendencies for collaboration in the middle levels.Proc.

    Natl.Acad. Sci. USA 2010, 107, 68416846.

    [40] Csete, M.,Doyle,J.,Bow ties, metabolism and disease. Trends

    Biotechnol. 2004,22, 446450.

    [41] Shoemaker, J. E., Doyle, F. J., Identifying fragilities in bio-

    chemical networks:Robust performance analysis of fas sig-

    naling-induced apoptosis.Biophys.J. 2008, 95, 26102623.