probabilistic interaction networks

Probabilistic Interaction Networks

Parameter fitting. Structural inference. Priors and regularization

Original Article

[12:14 4/8/2009 Bioinformatics-btp375.tex] Page: 2229 2229–2235

BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 17 2009, pages 2229–2235doi:10.1093/bioinformatics/btp375

Systems biology

Reconstructing signaling pathways from RNAi data usingprobabilistic Boolean threshold networksLars Kaderali1,∗, Eva Dazert2, Ulf Zeuge2, Michael Frese2,† and Ralf Bartenschlager21Viroquant Research Group Modeling, University of Heidelberg, Bioquant BQ26, Im Neuenheimer Feld 267 and2Medical Faculty, Department of Molecular Virology, University of Heidelberg, Im Neuenheimer Feld 345,69120 Heidelberg, GermanyReceived on January 29, 2009; revised on May 6, 2009; accepted on June 9, 2009

Advance Access publication June 19, 2009

Associate Editor: Trey Ideker

ABSTRACTMotivation: The reconstruction of signaling pathways from geneknockdown data is a novel research field enabled by developmentsin RNAi screening technology. However, while RNA interference isa powerful technique to identify genes related to a phenotype ofinterest, their placement in the corresponding pathways remains achallenging problem. Difficulties are aggravated if not all pathwaycomponents can be observed after each knockdown, but readoutsare only available for a small subset. We are then facing the problemof reconstructing a network from incomplete data.Results: We infer pathway topologies from gene knockdowndata using Bayesian networks with probabilistic Boolean thresholdfunctions. To deal with the problem of underdetermined networkparameters, we employ a Bayesian learning approach, in whichwe can integrate arbitrary prior information on the network underconsideration. Missing observations are integrated out. We computethe exact likelihood function for smaller networks, and use anapproximation to evaluate the likelihood for larger networks. Theposterior distribution is evaluated using mode hopping Markov chainMonte Carlo. Distributions over topologies and parameters can thenbe used to design additional experiments. We evaluate our approachon a small artificial dataset, and present inference results on RNAidata from the Jak/Stat pathway in a human hepatoma cell line.Availability: Software is available on request.Contact: [email protected] information: Supplementary data are available atBioinformatics online.

1 INTRODUCTIONThe qualitative and quantitative characterization of interactionsbetween genes in signal transduction and genetic regulatorynetworks is a major research goal in systems biology. RNAinterference (RNAi) offers an approach to systematically screen forgenes associated with a particular phenotype or cellular pathway ofinterest (Fire et al., 1998). As one example, several genome-widescreens have been published recently with the aim to identify hostfactors required for viral replication (Haasnoot et al., 2007).

∗To whom correspondence should be addressed.†Present address: Faculty of Applied Science, University of Canberra,Canberra, ACT, Australia.

Readouts for large-scale RNAi screens are typically based onsingle reporters such as viability (Boutros et al., 2004) or stainingof particular proteins (Brass et al., 2008). For RNAi assaysor other perturbation experiments involving a lower number ofknockdowns, high-dimensional readouts, e.g. using microarrays(Boutros et al., 2002) are feasible. High-content, high-throughputimage-based screens are rapidly developing, and offer opportunitiesfor high-dimensional readouts at a genome-wide scale (Sacher et al.,2008).

While RNAi is well suited to identify genes associated with aparticular phenotype, the temporal and spatial placement of thesegenes in the respective cellular pathways remains a challengingproblem (Moffat and Sabatini, 2006). Such placement is sometimespossible using extensive interrogation of databases and literature(König et al., 2008); however, more automated approachesapplicable also when no annotation is available for some or all geneswould clearly be desirable.

Most previous work on inferring networks from RNAi datafocuses on clustering of phenotypes to generate phenoclusters ofgenes showing similar effects upon perturbation (Sacher et al.,2008). Such methods are based on an underlying distance measurebetween phenotypes, which typically weighs each feature of aphenotype in the same, fixed way in determining the distance.However, such an equal treatment of different measurements,possibly even made in different units, is often inappropriate, andfavors features with high variability. In addition, such measures maymiss important properties of the data which are not related to thesimilarity of genes within a cluster, but to the relationships of effectsin different clusters (Markowetz et al., 2007).

To address these problems, Markowetz et al. (2005) haveproposed the Nested Effects Model (NEM) framework. NEMs usethe nested structure of observed perturbation effects to infer agenetic hierarchy of the perturbed genes. They have successfullybeen applied to several different datasets to date (Fröhlich et al.,2007; Markowetz et al., 2005, 2007). Anchang et al. (2009) haverecently proposed an extension of NEMs to account for dynamicaspects of signal transduction.

A disadvantage of NEMs is that they require relatively high-dimensional phenotype readouts per knockdown. This precludestheir use on screens with only one or a very limited number ofdifferent effect readouts per knockdown. In addition, NEMs are notbased on an underlying model describing how genes in a pathway

© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 2229

Brief summary

• Probabilistic interaction model

• Parameter estimation

• Regularization through priors

• Sanity tests

• Practical applications and results

Interaction Model

Interaction network

• A network of genes evolves over the time

• Time and gene values are discrete

x1(0)

x2(0) x3(0)

x4(0)

x5(0) x5(1)

x4(1)

x3(1)x2(1)

x1(1) x1(2)

x2(2) x3(2)

x4(2)

x5(2)

• Time-lag is one

Update rules

. . . !→ x(t− 1) !→ x(t) !→ x(t + 1) !→ . . .

• Updates are probabilisticx1(t− 1)

x2(t− 1)

x3(t− 1)

x4(t− 1)

x5(t− 1)

xi(t)

Pr [xi(1) = 1] =1

1 + exp (−γ · c)c = w0i + w1ix1(t− 1) + · · · + w5ix5(t− 1)

• Parameters determine the network struture

What really happens

2-2 -1.5 -1 -0.5 0 0.5 1 1.5

1.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

X Axis

Y A

xis

• There are three important regions

• determines the width of a linear region γ

Meaning of weights

• Weights encode network structure

• A sign of a weight determines whether a gene is repressor or activator

• Ratios between weights determine their impact to the outcome

• Model ignores higher order effects

Likelihood functionGiven any observation sequence

. . . !→ x(t− 1) !→ x(t) !→ x(t + 1) !→ . . .

we can compute the corresponding probability

Pr [x(0), . . . ,x(2)] = Pr [x(0)] · Pr [x(1)|x(0)] · Pr [x(2)|x(1)]

where the weights and scaling constant determine

the transformation probabilities Pr [x(t)|x(t− 1)]

Parameter estimation

Experiment setup

• Basic network topology is altered by introducing knockdowns

• Changes induce controlled variance and we get more information about the weights

• We use independence assumption to get a single likelihood formula

x(t) ≡ 0

Pr [Data] = Pr [KD-exp1] · · · Pr [KD-exp!]

Maximum likelihood method will fail

• For 5 genes, there are 6x5 = 30 parameters

• For 5 genes, there are 25x25=1024 possible transition patterns

• If we are extremely lucky , biologists collect at most 100 comparable samples

• Discretization destroys a lot of information

• Not enough information in the data

Regularization with prior distribution

• With priors one can essentially tell what he or she expect as an outcome

• One can remove irrelevant weights that cannot be in the network

• One can force sparsity and thus obtain more stable and interpretable results

Cookbook for priors

• There are three basic parameter groups that are independent

p [t, (w0i), (wij)] = p [t] · p [(w0i)] · p [(wij)]

• The first prior is used to penalize long experiments, which conway relatively little about the what transactions are possible

• Relevant only if time-points are taken on different intervals.

220 2 4 6 8 10 12 14 16 18 20

0.16

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Long tale is good

Time parameterGamma distribution is not the most reasonable choice for the prior for the time parameter

α = 8β = 1

Bias parameters

0-110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10

0.035

0

0.005

0.01

0.015

0.02

0.025

0.03

Gamma distribution forces bias to be negative and thus genes tend to be inactive

Most of transactions transformations must be improbable. Double-sided exponent is the tool to assure that.

Node connections

10-12 -10 -8 -6 -4 -2 0 2 4 6 8

0.07

0

0.01

0.02

0.03

0.04

0.05

0.06

x→ y

Missing observations

• If something is missing then we have to integrate it out

• This is consistent only if the data is assumed to be missing at random

• If values are missing depending on expression level, the approach will fail

Theory is done

• However, the posterior is not normalized and we have to integrate.

• If some values are missing, then we have to integrate them also out.

• This cannot be done analytically

• MCMC comes to rescue

Epic fail in practice

• Transaction probabilities cannot be computed for reasonable networks

• MCMC in a pure form is not applicable

• Let simulate the data to compute the approximate likelihood of the data

• MCMC does not converge to the posterior

What can we do!?

• Find some local maximas and occasionally reset MCMC and start from such points

• Implement the whole thing in C++ and hope that it is fast enough to converge

Sanity Tests

Test target

• Gene 4 is activated if 2 OR 3 are activated

• All other genes are activated if their parents are activated

• Individual knockouts were done for genes 2, 3 and 4

• A combined knockout was done for the gene pair 2 and 3

5

3

4

2

1

Test results

5

3

4

2

1 1

3

5

2

4

Pr [· · · ] = 0.39 Pr [· · · ] = 0.61

• Algorithm seems relatively insensitive to the parameters of prior.

• You must hit the parameter with precision 50% compared to the optimal choice.

• No added noise experiments.

Practical results

Gene network

• Network of 10 genes and 2 auxiliary inputs

[12:14 4/8/2009 Bioinformatics-btp375.tex] Page: 2233 2229–2235

Reconstructing signaling pathways

restricted the weights Mi, j to the positive quadrant, thus allowingonly activatory interactions in the network. Running time of thealgorithm including mode finding, optimization and sampling was∼126 min, with an acceptance rate of ∼27% in sampling.

Assignment of the points sampled to the modes of the posteriorusing Euclidian distance reveals two alternative topologies, whichreceive high-posterior probability mass: the ‘correct’ topologyshown in Figure 2 (P = 0.39), and the alternative network withconnections 1→4, 4→2, 4→3, 2→5 and 3→5 (P = 0.61).

In addition to the topologies, sampling yields distributions overindividual edges. For example, the connection from gene 4 to gene 2has a mean weight of 10.8, with SD 18.4. This edge is present in onlyone of the two inferred topologies. Correspondingly, a histogram ofthe sampled weights for this edge shows a bimodal distribution (datanot shown). On the other hand, the edge from node 2 to node 3 hasmean weight 0.39 with SD 6.0, reflecting that the edge is present inneither topology.

We repeated the computation five times, with different randomseeds, and obtained very similar results in all repetitions. To furtherassess the robustness of sampling with respect to parameters of theprior distributions, we carried out a systematic perturbation study.In brief, each parameter was varied individually by ±10% and±50%. In addition, we jointly varied all parameters simultaneouslyby the same amounts. Network reconstruction was repeated foreach setting, and resulting networks assessed. While magnitudesof reconstructed weights do scale with parameters of the respectiveprior, overall results are very similar to the ones reported above, andthe two alternative topologies are correctly reconstructed in all caseswhere parameters were varied by 10%. The method still robustlyidentified the two topologies for parameter variations of ±50%,except for the increase of a0 by 50% (learned network with onlydirect connection 1→5) and the decrease of γ by 50% (only oneof the network topologies found). Interestingly, results were againrobust to joint variation of all parameters by ±10% and ±50%.

We next repeated the computation using algorithm B, again usingfive repetitions, with parameters as above. Runtime of this approachwas ∼38 min and hence roughly 70% faster than using AlgorithmA. Results were otherwise completely comparable with the resultsreported for Algorithm A.

3.4 JAK/STAT signal transductionTo evaluate our approach on real biological data, we lookedfor a reasonably well-characterized signal transduction pathway.This allows it to compare our inference results with literatureknowledge, thus providing a means to evaluate the performanceof the reconstruction algorithm.

The Janus Kinases (JAK) and Signal Transducers and Activatorsof Transcription (STAT) pathway is central to cellular responseto cytokines and growth factors, and the core pathway is fairlywell understood (Platanias, 2005). Interferon (IFN) signaling viathe JAK/STAT pathway constitutes the first line of defense in viralinfections, with potent antiviral and growth-inhibitory effects.

JAK/STAT signal transduction is stimulated by type I or II IFNs,which activate two different receptor complexes in the pathway.Type I IFN receptor is composed of two subunits, IFNAR1 andIFNAR2, which are associated with the kinases TYK2 and JAK1,respectively. Activation of these kinases results in phosphorylation

INFAR1 INFAR2 INFGR1 INFGR2

JAK1 JAK1 JAK2

2TAT

S P

P

1TAT

S

P

STAT2

1TAT

S

P

1TAT

S P1

TATS

P

IRF9

Gene Transcription

TYK2

I II

Fig. 3. Classical JAK/STAT pathway, compare Platanias (2005). Signalingis triggered by type I or II IFN, which bind to IFN-α receptor andIFN-γ receptor complexes, respectively. Signaling then proceeds byphosphorylation of STAT1 and/or STAT2, leading to dimer formation andultimately transcription of genes relevant to immune response.

of STAT1 and STAT2, which leads to the formation of STAT1-STAT2-IRF9 heterotrimeric complexes, which translocate to thecell nucleus and initiate transcription of a wide range of proteins,including some antiviral genes.

Similarly, type II IFN receptor is composed of INFGR1 andIFNGR2, which are in turn associated with JAK1 and JAK2,respectively. Their activation leads to phosphorylation of STAT1,which forms a homodimer and after translocation to the nucleusalso leads to gene transcription, including certain antiviral genes.

In addition, IFN type I-induced phosphorylation of STAT1 canalso lead to formation of STAT1 homodimers. Figure 3 summarizesthe signaling cascades in the classical JAK/STAT pathway.

3.4.1 RNAi experiments: We carried out systematic knockdownsof the 10 genes, IFNAR1, IFNAR2, IFNGR1, IFNGR2, JAK1,JAK2, TYK2, STAT1, STAT2 and IRF9, involved in the JAK/STATpathway (Fig. 3) under three different conditions: no stimulation,IFN-α stimulation and IFN-γ stimulation, in a human hepatoma cellline (Nakabayashi et al., 1982) stably transfected with a subgenomic,self-replicating hepatitis C virus (HCV) RNA(Lohmann et al., 1999;Vrolijk et al., 2003). As negative controls, additional siRNAs againsta viral gene from an unrelated virus was used, as well as wellscontaining naive cells only, without siRNA. The phenotype readoutin our assay consists of measurements of reporter protein production(luciferase), whose amount depends on replication of the HCV RNA.Measurements were done in two complete biological replicates each,with two siRNAs per gene and three replicate knockdowns persiRNA and biological replicate, yielding a total of six measurementsper siRNA or 12 measurements per gene for each of the threeconditions. Details on the experimental procedure are given in theSupplementary Material.

2233

Practical experiment

• 10 single gene knockouts

• 3 inital conditions 00, 10, 01

• 12 samples for each setup yielding a single transformation

• Fails in global network reconstruction

• Successful in reconstructing edges

x(0)→ x(t)

probabilistic interaction networks

Documents