technical report march 13, 2012 department of computer...

26
*Corresponding author. Email: [email protected] TECHNICAL REPORT March 13, 2012 Department of Computer Engineering Middle East Technical University Construction of signaling pathways from PPI and RNAi data using Linear Programming Oyku Eren Ozsoy a* and Tolga Can b a Informatics Institute, Middle East Technical University, Ankara, TURKEY; b Department of Computer Engineering, Middle East Technical University, Ankara, TURKEY For the reconstruction of signaling pathways from RNAi data, an integer linear optimization model is proposed. The aim is to reconstruct the signaling network from the given protein protein interaction (PPI) network satisfying RNAi data by making minimum changes on the given network. For evaluation, 1000 reference PPI networks each with seven, eight, or nine proteins, and RNAi data for each of the regular proteins in the network were generated randomly. The solution was examined to have a general overview about reconstruction of signaling networks from RNAi data by using the proposed method. Keywords: Computational biology; bioinformatics; integer programming; graph theory and networks 1. Introduction Finding the relationship between the genes in signal transduction networks and gene regulatory networks/protein-protein interaction (PPI) networks is a very important problem in systems biology. Signal transduction is a series of biochemical reactions involving proteins, therefore PPI data can be used as a source for reconstruction of the signaling pathways. While signaling pathways are considered to be directed, high-throughput protein-protein interaction data is undirected. It is possible to transform this undirected data to directed pathways [3]. For the reconstruction of signaling pathways from PPI data, several methods have been developed, such as color coding algorithm [11] and Netsearch algorithm [12]. RNA interference (RNAi) is a technique used for finding the genes in a pathway [2]. For large-scale RNAi screens, the readouts are generally based on single reporters [1]. High-

Upload: others

Post on 03-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

*Corresponding author. Email: [email protected]

TECHNICAL REPORT

March 13, 2012

Department of Computer Engineering

Middle East Technical University

Construction of signaling pathways from PPI and RNAi data using Linear

Programming

Oyku Eren Ozsoya*

and Tolga Canb

a Informatics Institute, Middle East Technical University, Ankara, TURKEY;

b Department of Computer

Engineering, Middle East Technical University, Ankara, TURKEY

For the reconstruction of signaling pathways from RNAi data, an integer linear optimization

model is proposed. The aim is to reconstruct the signaling network from the given protein protein

interaction (PPI) network satisfying RNAi data by making minimum changes on the given

network. For evaluation, 1000 reference PPI networks each with seven, eight, or nine proteins, and

RNAi data for each of the regular proteins in the network were generated randomly. The solution

was examined to have a general overview about reconstruction of signaling networks from RNAi

data by using the proposed method.

Keywords: Computational biology; bioinformatics; integer programming; graph theory and

networks

1. Introduction

Finding the relationship between the genes in signal transduction networks and gene

regulatory networks/protein-protein interaction (PPI) networks is a very important problem in

systems biology. Signal transduction is a series of biochemical reactions involving proteins,

therefore PPI data can be used as a source for reconstruction of the signaling pathways. While

signaling pathways are considered to be directed, high-throughput protein-protein interaction

data is undirected. It is possible to transform this undirected data to directed pathways [3].

For the reconstruction of signaling pathways from PPI data, several methods have been

developed, such as color coding algorithm [11] and Netsearch algorithm [12].

RNA interference (RNAi) is a technique used for finding the genes in a pathway [2].

For large-scale RNAi screens, the readouts are generally based on single reporters [1]. High-

Page 2: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

content, high-throughput image-based screens at a genome-wide scale are developing rapidly

[10]. Identification of genes associated with a particular phenotype can be done by the RNAi

technique. This technique investigates the downstream effects of a silenced gene. However,

their placement in space and time in the respective cellular pathways remains a problem [9].

It is possible to place genes in the respective pathways by interrogation of databases and

literature [7]. In the cases when there is insufficient information about the genes in the

pathways, automated approaches that place these genes in the network have to be developed.

A survey on the methods developed for such purposes are performed by Kaderali and Radde

[5]. The developed methods use Boolean models, correlation based models and associative

network approach, Bayesian networks, differental equations models and similar techniques.

Several methods use microarray gene expression as data and aim to generate gene regulation

networks using time dependent or static data. Some methods like Bayesian networks allow

the integration of biological prior information. Despite all these developed methods, the

temporal and spatial placement of the genes in a signaling pathway is still a challenging

problem.

For the construction of signal transduction networks using RNA interference, only a

few methods are available. Markowetz et al. [8] proposed Nested Effect Models for this

problem. Such models construct the signal transduction networks by using the nested

structure of observed perturbation effects. Although they are suitable models for effect-result

kind of data, such as RNAi data, they require several kinds and relatively high number of

readouts per knockdown. This prevents the usage of the results of RNAi experiments using

one messenger gene. Kaderali et al. [6] developed a probabilistic method that can reconstruct

network topologies using the single gene knockdown data by generating topologies consistent

with this data. However, because of the computational complexity of this method, only small

networks can be solved in a reasonable time and also it is not possible to use the data other

than the RNAi data, such as time series and real time screening data. They show that still

some information about the pathway topology can be retrieved, and results can be used to

design additional experiments to resolve the topology further.

In order to identify the network topologies from the given knockdown data, Kaderali

et al. [6] consider a deterministic model, where a node i is activated, if any node j with an

edge j-i is active. They simulate data with this model, and use complete enumeration of all

network topologies with a given number of nodes to determine how many topologies are

compatible with the data.

Page 3: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

The problem here is; as the number of nodes increases, the number of compatible

topologies increases exponentially. For n different components in the network, there are n2

edges, each of which can be present or absent in a network, leading to 2n×n

different possible

network topologies. On the other hand, single-gene knockdowns with a single phenotypic

readout per knockdown will yield only n bits of information for network reconstruction.

When distinguishing activating and deactivating influences, this problem grows even further.

Then additional information is needed to be able to reconstruct a unique topology, such as

further observable nodes, stimulation of other nodes, time series measurements or

combinatorial knockdowns. They counter this problem using a prior distribution on model

parameters that drives the network inference to sparse networks. In addition, they can make

use of multiple (e.g. double) knockdowns if available, or of multiple different stimulations.

Using Boolean threshold functions in a Bayesian network, they sample from the posterior

over model parameters. For larger networks, they use a likelihood approximation based on

stochastic simulation. The model allows inclusion of further observable nodes, stimulation of

other nodes, time series measurements or combinatorial knockdowns.

In this study, for the reconstruction of signal transduction pathways from the RNAi

data, a linear program based model is proposed. The major difference of our approach from

the others is that we formulate the problem so as to edit a given reference network with

respect to given RNAi data. This reference network is a PPI network in our approach.

Assume that a network, which might be already constructed based on the experiments carried

out in the past, is given as a reference network. If there is new information about the network,

e.g. new RNAi data is available; this network has to be updated with respect to the new data.

In our approach, from the given reference network, a new network is reconstructed that

satisfies RNAi data by applying minimum changes on the given network. The network is

represented as a graph consisting of a set of nodes and edges. First, a binary state variable is

assumed for each edge: the state variable is 1 if the edge is present in the network, otherwise

it is 0. Then, each knockdown data is formulated as a linear constraint after enumerating all

possible paths from the source node s to the sink node t. This is done by considering whether

a path transmits the signal to the sink node or not. If the signal is transmitted, then at least one

path is complete and therefore, the sum of state variables of the edges must be equal to the

number of edges in this path. Otherwise, all possible paths are incomplete and the sum of

state variables of the edges must be smaller than the number of edges in this path. The

objective function for this linear problem is minimizing the sum of the absolute values of the

Page 4: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

differences between the state variables of each edge in the reference network and the new

network. We solved this binary integer linear programming problem by generating the

corresponding LP file automatically and using CPLEX v12. The LP file is created by a code

written in C language automatically according to the data supplied, i.e. reference network,

number of nodes, and knockdown data. We considered several problems with different

reference networks, number of nodes and knockdown data. Our experiments show that, the

topology of a network with 10 nodes can be constructed by our approach in a reasonable

time.

This article is organized as follows. In Section 2, the model that we developed for the

reconstruction of signal transduction pathways from the RNAi data is explained in detail. In

Section 3, information about the data sets that we used to evaluate our model is given. The

results for the problems solved by using the method and a discussion on the results are given

in Section 4.

2. Methods

Consider a given directed graph G(V,E) where V represents the node set and E

represents the edge set, with a source node s, and a sink node t. This graph may be taken from

any of the protein-protein interaction (PPI) network database, where each node represents a

protein and assume that the RNAi data is available from RNAi experiment results. Although

PPI networks have undirected edges, they can be transformed to directed edges [3]. The aim

is to reconstruct a new network from the given network satisfying RNAi data by making

minimum changes on the given network. The approach would be to formulate this problem as

a linear optimization problem, which will provide a network satisfying the RNAi data with a

minimum change applied on the given network.

Let xij be the binary variable representing the presence of the edge between nodes i and

j which is from node i to node j in the given network. If the edge is present, then the value of

xij is 1, otherwise it is 0. Similarly, let wij represent the edges in the network that is to satisfy

the given RNAi data. The RNAi data consists of the information whether the signal is

transferred from the source node s to the sink node t, or not, after the knockdown of a single

node or multiple nodes. We will call these binary variables “the state variables”. The goal is

to reconstruct the given network with respect to the RNAi data by minimizing the changes

that have to be applied to the given reference network. The objective function for this linear

Page 5: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

problem would be the sum of the absolute values of the differences between wij and xij, i.e. |xij

- wij|. If the edge is present both before and after the optimization, or not, then the difference

becomes 0, which means no change is made on the corresponding edge in the network.

However if the difference is 1, it means that the edge is either taken out from the network or

it is inserted into the network. Therefore, minimizing the sum of these differences results in a

network that is obtained by making minimum number of changes on it while satisfying the

constraints obtained from the knockdowns.

The result of each knockdown can be formulated as a linear constraint after

enumerating all possible paths from the source node s to the sink node t. If the signal is not

observed at the sink node after knockdown of a node (protein), then any path from source s to

sink t excluding the knockdown node should not be complete, i.e., the path has to be broken

somewhere between the source and the sink. If it is observed, then at least one of the possible

paths not including the knockdown node should be complete. If a path is not complete, then

at least one of the edges on the path should not be present. Therefore, the state variable

corresponding to that edge must be 0. If a path is complete, then all edges on the path must be

present and the corresponding state variables take the value 1.

To visualize the discussion above, consider a network consisting of 5 nodes, two of

which are the source node s and the sink node t, as in (Figure 1).

Figure 1. (a) A 5-node network with all possible edges, (b) given initial network: solid lines show the connected

edges, dashed lines show the disconnected edges.

Note that, there are no edges going into the source node s and no edges coming out of the

sink node t. Also, self-edges are not allowed and there is no direct edge from source to sink.

Even with these assumptions, if we disregard the sign of an edge (whether it activates or

inactivates its target node) the number of possible network topologies would be close to 2nxn

.

s t

1 2

3

s t

1 2

3

(a) (b)

Page 6: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

The topology shown includes all possible paths from the source to the sink. However, in the

the RNAi experiments, a node is knocked down and its edges are disabled.

Let (s-3-t) be a given path on this network which is taken from a PPI network

database, as in (Figure 1b). All other edges are inactive and it is known that this network

consists of 5 nodes. Let the knockdown data be as given in (Table 1). According to this data,

knockdown of protein 1 (node 1) causes the sink node t to be not activated. This result can be

written as mathematical constraints considering all possible paths which do not include node

1. Since no activation of the sink node is observed, none of these paths can transmit the

signal, i.e. they must all be broken at one edge or some edges. Such constraint can be

satisfied by only setting at least one of the state variables of the edges on these paths to 0.

Therefore, for a non-transmitting path, the product of the state variables must be 0.

Knockdown Effect on sink node t

Node 1

Node 2

Node 3

Not activated

Activated

Activated

Table 1. Artificial knockdown data. Source gene s is activated, while all other genes are inactive. Readout is

done at gene t.

Now, all the paths which do not include an edge connected to node 1 should be

determined. For our problem with the knockdown of node 1, these non-transmitting paths are

(s-2-t), (s-3-t), (s-2-3-t), and (s-3-2-t). All of these paths must be broken, i.e. one of wij’s in

these paths must be zero. This condition can be formulated for our problem as follows:

(1)

(2)

, (3)

(4)

Since at least one of the state variables at the left hand sides of the inequalities is zero, the

corresponding sum must be less than the number of terms in the inequality. There is a logical

“AND” relationship between all of these constraints. They must all be satisfied at the same

time; otherwise the signal would be transmitted from the source node to the sink node.

Page 7: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Next, the second knockdown data is to be written as a constraint for our linear

programming problem. Knockdown of node 2 results in an activation of the sink node t. This

observation means that at least one path that does not include node 2 must be complete, i.e.

not broken, so that it is possible to transmit the signal from source to the sink. If a path

transmits the signal, then all of the state variables of the edges on that path must be 1. For our

problem with the knockdown of node 2, at least one path that does not include an edge

connected to node 2 must be present in the network. These paths are (s-1-t), (s-3-t), (s-1-3-t),

and (s-3-1-t). At least one of these paths has its wijs entirely equal to 1. This condition can be

formulated for this problem as follows:

, or (5)

, or (6)

, or (7)

. (8)

Note that the inequality signs (greater than or equal to) can be replaced by equalities, since

the state variables are binary variables. Between these constraints, there is a logical “OR”

condition, i.e. at least one of them must be satisfied. Note that constraint (6) cannot be

satisfied because of the constraint stated in (2), therefore it can be omitted from the

formulation.

Similarly, the knockdown of node 3 implies that at least one of the paths (s-1-t), (s-2-

t), (s-1-2-t), and (s-2-1-t) must be present in the network. The formulation of this constraint is

as follows:

, or (9)

, or (10)

, or (11)

(12)

Here, the constraint (10) cannot be satisfied because of the constraint (1), therefore it can be

excluded from the formulation.

Page 8: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

After combining the objective function and all these constraints, the problem is stated as

follows:

Minimize (13)

Subject to

(14)

(15)

, (16)

(17)

, or (18)

, or (19)

, or (20)

. (21)

, or (22)

, or (23)

, or (24)

. (25)

Now, it is necessary to convert the constraints related with “OR” (18)-(25) into linear

form. This can be done by using the conversion explained below [4]:

2.1 Either/Or constraints

Suppose that at least one of the following equalities must hold:

Either , (26)

or . (27)

Using a sufficiently large positive number M, an equivalent set of constraints can be

formulated as

, (28)

Page 9: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

(29)

where y is a binary variable. Since y must be either 1 or 0, both (28) and (29) are satisfied if

at least one of (26) and (27) is satisfied. A more general case when K out of N constraints

must hold is explained below [4].

K out of N Constraints must hold:

If there are several constraints only some of which must hold, formulation of either/or

constraint explained above can be expanded to account for such requirement. Assume that

there are N constraints and K of them must hold where K < N. The formulation is stated as

.

.

The equality in can be changed by an inequality ≤ if at least K

constraints are required to be satisfied.

If yi = 0, then the original constraint is obtained. However, if yi = 1, because of the

large positive number M, even if the original constraint is not satisfied, the new constraint is

satisfied. Since K out of N constraints must hold, the summation of yi’s should be equal to N-

K.

Now, we can reformulate the constraints (18)–(25) using the method described above

to transform “OR” conditions into “AND” conditions. The constraints (18)-(21) and (22)-(25)

then take the form

, (30)

Page 10: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

, (31)

, (32)

, (33)

. (34)

, (35)

, (36)

, (37)

, (38)

. (39)

Since it is required that at least one of the constraints (30)-(33) and (35)-(38) must hold, (34)

and (39) are written as an inequalities respectively and N - K = 4 – 1 = 3. Eliminating the

constraints that cannot be satisfied due to the constraints obtained from knockdown of node

1, the problem can be stated as

Minimize (40)

Subject to

(41)

(42)

, (43)

(44)

, (45)

, (46)

, (47)

. (48)

, (49)

, (50)

Page 11: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

, (51)

. (52)

Let us return to the original problem, where the given network consists of nodes s-3-t.

The state variables for this network are xs3=1, x3t=1, and all the remaining are 0. The optimum

solution to this problem gives 3 as the objective function value and the state variables are

found as w3t=1, ws1=1, w1t=1, and the remaining are 0. Therefore, totally 3 changes are

applied to satisfy the given constraints; edge s-3 is removed and edges s-1 and 1-t are added

to the network. The final network structure is given in (Figure 2).

Figure 2. The resulting network satisfying the RNAi data as given in (Table 1) and therefore the constraints

(55)-(66).

The solution makes sense considering the given constraints. In order to prevent signal

transmission with the knockdown of node 1, either the edge s-3 or the edge 3-t must be

removed from the network. Also, in order for the signal to be transmitted after the

knockdown of node 2 or 3, the path s-1-t must be present in the network. As it can be

understood, there may be more than one optimum solution for this problem. The solution

shown here is obtained by the software CPLEX v12.3, therefore only one of the possible

solutions is obtained. Instead of removing the edge s-3 and keeping the edge 3-t, removing

the edge 3-t and keeping the edge s-3 results in another optimum solution.

2.2 Automatic generation of constraints

In order to solve the network problem described in the previous section by using the

software CPLEX, it is necessary to generate the constraints automatically as the number of

constraints becomes very large even for pathways with small number of nodes. An LP-format

file which can be read by CPLEX is created by a code written in C. The code generates the

s t

1 2

3

Page 12: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

objective function with the input data, i.e. a given reference PPI network. The objective

function consists of all the edge values wij that are to be found and the values xij which are the

edge values for the given initial network. Then, for RNA interference data, the constraints are

generated. Firstly, the constraints for knockdown of the nodes which do not activate the sink

node t are generated. The constraints include all possible paths from the source node s to the

sink node t which do not include the knockdown node and there exists an “AND” relation

between them as explained above. These paths have a minimum of 2 edges and a maximum

of n-1 edges for a network consisting of n nodes since a direct edge from source to sink and

paths with loop are not allowed. For all configuration of paths, i.e. paths with different

number of edges (e.g. paths consisting of 2 edges, 3 edges, …, n-1 edges), the constraints are

written in the LP-file. Next, the constraints for the knockdown of the nodes which activate

the sink node t are generated. The constraints include all possible paths from the source node

s to the sink node t which do not include the knockdown node and there exists an “OR”

relation between them as explained above. They also have a minimum of 2 edges and a

maximum of n-1 edges. Since they are related with “OR” condition, yi and M values are

added to the constraints and the number of yi values are counted to write the additional

constraints as in (48) and (52). Some of these paths are already dealt with when considering

the nodes which are not activating the sink node, therefore the corresponding constraint

cannot be satisfied and they are excluded. In fact, this is the reason why the constraints for

the nodes which do not activate the sink node are generated first. Lastly, the variables wij and

yi are defined as binary variables in the LP-file.

3. Data Sets

To evaluate the proposed formulation, we randomly generated 1,000 reference

networks each with seven, eight, or nine genes, including the receptor and reporter genes.

Each edge in a network is an outcome of a Bernoulli trial with probability 0.5. We also

randomly generated RNAi constraints for each of the regular genes in the network with

p(effecting gene=1)=0.5. Therefore, most of the initial PPI networks are dense networks.

Then, for each PPI network we generate RNAi data. The knockdown results are simulated by

randomly assigning a value of 0 or 1 to the knockdown data for each node between the source

and the sink. These values represent the observable results at the sink node after each

knockdown. If the observable value is 0 at the sink node, the knockdown node is a an

affecting protein. If it is 1, then the corresponding node is a non-affecting protein. From this

Page 13: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

RNAi data, the constraints for our ILP are constructed. If the knockdown result is 0, “and”

constraints are created; if it is 1, “or” constraints are written down.

4. Results and Discussion

Given a reference PPI network and a set of RNAi constraints, the corresponding

integer linear program is solved by CPLEX_12 (64 bit). For each type of network, i.e. 7, 8

and 9 node networks, 6 different results are inspected, namely, number of affecting genes vs.

average solution time, number of affecting genes vs. number of changes applied on the initial

network, number of edges in the initial network vs. average solution time, number of edges in

the initial network vs. number of changes applied on the initial network, number of

constraints vs. number of affecting genes, difference in number of “and” and “or” constraints

vs. number of affecting genes. The number of affecting genes for an n-node network can be at

most n-2, i.e. all the nodes between the source node and the sink node are affecting genes. If

all of the nodes are affecting nodes, we do not solve the problem, since the formulation

described above is valid when there is at least one complete path from source node to the sink

node. The presence of a complete path in this formulation is guaranteed by having a non-

affecting node, because at least one of the paths that do not include the non-effecting node

must be complete.

First, 7-node networks are considered. The number of 7-node networks created by the

C-code randomly are shown in (Figure 3) with respect to the number of affecting genes. The

figure shows a normal distribution since the networks are created randomly with a 0.5

probability of a node being anaffecting node. In some of the networks, all genes (i.e. 5 nodes)

are affecting nodes. In that case, the problem is not solved. Similar graphs for 8 and 9 node

networks are shown in (Figure 10) and (Figure 17). (Figure 4) shows the average solution

times of the ILP for different number of affecting genes for 7-node networks. The red bars

represent the standard deviation in solution time and are really high. This is because the

solution time depends not only on the number of affecting genes but also on many other

variables, such as number of constraints and number of edges in initial network. The average

solution time shows a normal distribution over the number of affecting genes. This can be

explained by the number of “and” and “or” constraints. Although the number of constraints

decreases with the increasing number of affecting genes, as shown in (Figure 8) the

difference between them shows a similar but reverse behavior with the average solution time,

Page 14: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

as shown in (Figure 9). If the difference between the number of “and” and “or” constraints

are small, the average solution time is high; if the difference is high, then the average solution

time is low. From this observation, we can conclude that as the “and” and “or” constraints

mix more, the solution time gets higher since finding the edge values minimizing the

objective function and at the same time satisfying both the “and” and “or” constraints are

getting harder.

In (Figure 5), the number of changes made on the initial network with respect to the

number of affecting genes are shown. As the number of affecting genes increases, the number

of changes made on the initial network also increases. This is due to the increase in the

number of “and” constraints with the number of affecting genes. Satisfying all the “and”

constraints is more difficult than satisfying only one of the “or” constraints, because while

satisfying only one “or” constraint is enough, all the “and” constraints must be satisfied at the

same time.Therefore, more changes are made on the initial network as the number of

affecting genes increases.

As mentioned before, the average solution time may also depend on the number of

edges in the initial network. Such a dependence is shown in (Figure 6). Here, the number of

edges in initial networks span from 7 to 23. This is because these networks are created

randomly and each edge has a 0.5 probability of being present in the initial network. (Figure

6) shows that the average solution time slightly increases with the number of edges in the

initial network. This is as expected because as the number of edges in initial network

increases, more edges should be removed from the network to satisfy the constraints and

therefore the time required to find such edges increases. Consequently, as shown in (Figure

10), the number of changes made in the initial network also increases with the increasing

number of edges in the initial network. In these two figures, while the average solution time

deviates too much, the standard deviation of the number of the changes in initial network is

smaller and proportinal to the number of edges in the initial network.

The above discussion is also valid for 8 node networks, (Figure 10) to (Figure 16);

and 9 node networks, (Figure 17) to (Figure 23). The created networks have a normal

distribution. As the number of nodes increase, the total number of constraints increase

exponentially, which in turn results in an exponential increase in the solution time. Similarly,

the number of edges in initial network and the number of changes imposed on the initial

network increases with number of nodes in the network.

Page 15: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 3. Number of 7-node networks that are created randomly for different number of

affecting genes.

Figure 4. Average solution times for the 7-node networks with different number of affecting

genes.

0

50

100

150

200

250

300

350

0 1 2 3 4 5

# o

f N

etw

ork

s C

reat

ed

# of Affecting Genes

0

0,02

0,04

0,06

0,08

0,1

0,12

0,14

0 1 2 3 4

Ave

rage

So

luti

on

Tim

e

# of Affecting Genes

stdev

avg time

Page 16: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 5. Average number of changes applied on the reference 7-node networks with

different number of affecting genes.

Figure 6. Average solution times for the 7-node networks with different number of edges in

the reference networks.

0

1

2

3

4

5

6

7

0 1 2 3 4

# o

f C

han

ges

in In

itia

l Ne

two

rk

# of Affecting Genes

stdev

# of chg

0

0,05

0,1

0,15

0,2

0,25

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Ave

rage

So

luti

on

Tim

e

# of Edges in Initial Network

std dev

avg time

Page 17: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 7. Average number of changes applied on the reference 7-node networks with

different number of edges in the reference networks.

Figure 8. Number of AND and OR constraints for 7-node networks with different number of

affecting genes.

0

50

100

150

200

250

300

350

0 1 2 3 4

# o

f C

on

stra

ints

# of Affecting Genes

# of or cons

# of and const

0

1

2

3

4

5

6

7

8

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

# o

f C

han

ges

in I

nit

ial N

etw

ork

# of Edges in Initial Network

std dev

# of chg

Page 18: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 9. Difference between AND and OR constraints for 7-node networks with different

number of affecting genes.

Figure 10. Number of 8-node networks that are created randomly for different number of

affecting genes

0

50

100

150

200

250

300

350

0 1 2 3 4

Dif

fere

nce

Be

twe

en

Co

nst

rain

ts

# of Affecting Genes

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6

# o

f N

etw

ork

s

# Affecting Genes

Page 19: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 11. Average solution times for the 8-node networks with different number of affecting

genes.

Figure 12. Average number of changes applied on the reference 8-node networks with

different number of affecting genes.

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

0 1 2 3 4 5

Ave

rage

So

luti

on

Tim

e

# of Affecting Genes

stdev

avg time

0

1

2

3

4

5

6

7

8

9

0 1 2 3 4 5

# o

f C

han

ges

in In

itia

l Ne

two

rk

# of Affecting Genes

stdev

chg in soln

Page 20: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 13. Average solution times for the 8-node networks with different number of edges in

the reference networks.

Figure 14. Average number of changes applied on the reference 8-node networks with

different number of edges in the reference networks.

0

0,1

0,2

0,3

0,4

0,5

0,6

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Ave

rage

So

luti

on

Tim

e

# of Edges in Initial Network

std dev

avg time

0

2

4

6

8

10

12

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

# o

f C

han

ges

in I

nit

ial N

etw

ork

# of Edges in Initial Network

std dev

# of chg

Page 21: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 15. Number of AND and OR constraints for 8-node networks with different number of

affecting genes.

Figure 16. Difference between AND and OR constraints for 8-node networks with different

number of affecting genes.

0

500

1000

1500

2000

2500

0 1 2 3 4 5

# o

f C

on

stra

ints

# of Affecting Genes

# of or cons

# of and cons

0

500

1000

1500

2000

2500

0 1 2 3 4 5

Dif

fere

nce

Be

twe

en

Co

nst

rain

ts

# of Affecting Genes

Page 22: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 17. Number of 9-node networks that are created randomly for different number of

affecting genes

Figure 18. Average solution times for the 8-node networks with different number of affecting

genes.

0

50

100

150

200

250

300

0 1 2 3 4 5 6 7

# o

f N

etw

ork

s

# of Affecting Genes

0

5

10

15

20

25

30

35

40

45

50

0 1 2 3 4 5 6

AV

era

ge S

olu

tio

n T

ime

# of Affecting Genes

std dev

avg time

Page 23: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 19. Average number of changes applied on the reference 9-node networks with

different number of affecting genes.

Figure 20. Average solution times for the 9-node networks with different number of edges in

the reference networks.

0

2

4

6

8

10

12

0 1 2 3 4 5 6

# o

f C

han

ges

in In

itia

l Ne

two

rk

# of Affecting Genes

std dev

# of chg

0

10

20

30

40

50

60

70

80

90

100

18 20 22 24 26 28 30 32 34 36 38 41

AV

era

ge S

olu

tio

n T

ime

# of Edges in Initial Network

std dev

avg time

Page 24: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 21. Average number of changes applied on the reference 9-node networks with

different number of edges in the reference networks.

Figure 22. Number of AND and OR constraints for 9-node networks with different number of

affecting genes.

0

2

4

6

8

10

12

14

16

18 20 22 24 26 28 30 32 34 36 38 41

# o

f C

han

ges

in In

itia

l Ne

two

rk

# of Edges in Initial Network.

std dev

# of chg

0

2000

4000

6000

8000

10000

12000

14000

16000

0 1 2 3 4 5 6

# o

f C

on

stra

ints

# of Affecting Genes

# or cons

# and cons

Page 25: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

Figure 23. Difference between AND and OR constraints for 9-node networks with different

number of affecting genes.

5. Acknowledgement

The support of the European Union in the 7th

framework program through SysPatho, grant

260429 is greatly appreciated.

6. References

[1] A.L. Brass, D.M. Dykxhoorn, Y. Benita, N. Yan, A. Engelman, R.J. Xavier, J.

Lieberman and S.J. Elledge, Identification of host proteins required for HIV infection

though a functional genomic screen, Science 319 (2008), pp. 817–824.

[2] A. Fire, S. Xu, M.K. Montgomery, S.A. Kostas, S.E. Driver, C.C.Mello, Potent and

specific genetic interference by double-stranded RNA in Caenorhabditis elegans,

Nature 391 (1998), pp. 806–811.

[3] A. Gitter, J. Klein-Seetharaman, A. Gupta and Z.B. Joseph, Discovering pathways by

orienting edges in protein interaction networks, Nucl. Acids Res. 39(4) (2011).

[4] F.S. Hillier, G. J. Lieberman, Introduction to Operations Research, McGraw-Hill,

New York, NY, 2001.

[5] L. Kaderali, N. Radde, Inferring gene regulatory networks from expression data,

Computational Intelligence in Bioinformatics (2008), pp. 33-74

0

2000

4000

6000

8000

10000

12000

14000

16000

0 1 2 3 4 5 6

Dif

fere

nce

Be

twe

en

Co

nst

rain

ts

# of Affecting Genes

Page 26: TECHNICAL REPORT March 13, 2012 Department of Computer ...user.ceng.metu.edu.tr/~tcan/TR/TR_erenOzsoy_March2012.pdf · developed, such as color coding algorithm [11] and Netsearch

[6] L. Kaderali, E. Dazert, U. Zeuge, M. Frese, R. Bartenschlager, Reconstructing

signaling pathways from rnai data using probabilistic boolean threshold network,

Bioinformatics 25 (17) (2009), pp. 2229-2235.

[7] R. König, Y. Zhou, D. Elleder, T.L. Diamond, G.M.C. Bonamy, J.T. Irelan, C.

Chiang, B.P. Tu, P.D.D Jesus, C.E. Lilley, S. Seidel, A.M. Opaluch, J.S. Caldwell,

M.D. Weitzman, K.L. Kuhen, S. Bandyopadhyay, T. Ideker, A.P. Orth, L.J. Miraglia,

F.D. Bushman, J.A. Young and S.K. Chanda, Global analysis of host-pathogen

interactions that regulate earlystage HIV-1 replication, Cell 135 (2008), pp. 49–60.

[8] F. Markowetz, D. Kostka, O.G. Troyanskaya and R. Spang, Nested effects models for

high-dimensional phenotyping screens, Bioinformatics 23 (2007), pp. 305–312.

[9] J. Moffat, and D.M. Sabatini, Building mammalian signaling pathways with RNAi

screens, Nat. Rev. Mol. Cell Biol. 7 (2006), pp. 177–187.

[10] R. Sacher, L. Stergiou and L. Pelkmans, Lessons from genetics: interpreting complex

phenotypes in RNAi screens, Current Opinion in Cell Biology 20 (2008), pp. 483–

489.

[11] J. Scott, T. Ideker, R. Karp, R. Sharan, Efficient algorithms for detecting signaling

pathways in protein interaction networks, J. Comput. Biol. 13 (2006), pp. 133-144.

[12] M. Steffen, A. Petti, J. Aach, P. D'haeseleer, G. Church, Automated modelling of

signal transduction networks, BMC Bioinformatics 3/34 (2002).