computational biology methods for drug discovery_phase 1-5_november 2015

23
— COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBBD) PHASE 1-5 Marina Bessarabova, Ph.D. Director, Computational Biology & Bioinformatics [email protected] Alex Ishkin, Ph.D. Senior Science Analyst [email protected] September, 2015

Upload: mathew-varghese

Post on 11-Feb-2017

242 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

— COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBBD) PHASE 1-5Marina Bessarabova, Ph.D. Director, Computational Biology & Bioinformatics [email protected]

Alex Ishkin, Ph.D. Senior Science Analyst [email protected]

September, 2015

Page 2: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 2

EXECUTIVE SUMMARYSystems biology has become a powerful approach for drug discovery and development, bringing together OMICs data, structured “knowledge” databases of networks and pathways. and tools for network analysis of OMICs data.

Significant effort has been invested in the production and annotation of OMICs data and collection of biological information focusing on networks and pathways. These have been made available for research purposes both in the public domain and from proprietary sources.

However, OMICs data and network databases are only as useful as the analytical methods available for analysis.

SOLUTION: COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY Thomson Reuters (TR), one of the leading providers of systems biology tools (such as MetaCore™ and MetaBase™), has launched a “Computational Biology Methods for Drug Discovery” (CBDD) program which is focused on the implementation of advanced state-of-the- art approaches for network and pathway analysis of OMICs data.

WITH THE CBDD PROGRAM Gain access to the best systems biology approaches – Many of the most important methods developed for network analysis of OMICs data over the last 10 years have been implemented by the TR Systems Biology Team.

Get working tools – Algorithms implemented in convenient, well supported packages accessible from R which can be directly applied to networks and pathways represented in standard formats.

Maximize utility of internal and external network/pathway information resources – Increase the value of the analysis of OMICs data using a range of network methods.

Free up more time by analysts to work on real analysis projects to deliver information on new drug targets, biomarker identification, patient stratification, etc. rather than tool development.

Page 3: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 3

CBDD MEMBERSHIP SCOPE

OVERVIEW

The CBDD toolkit is designed to enable seamless analysis of diverse data sets within the context of any networks available to the user. A typical design of a computational workflow in systems biology looks like the following:

1. Gather data to analyze;

2. Collect an appropriate molecular network;

3. Select the algorithm intended to solve a particular problem.

The molecular data for analysis are typically readily available (either from the public domain or generated in-house) and have a relatively simple structure. CBDD helps with the two more complex steps, namely network generation and algorithm selection.

The network generation is a more challenging area. There are two major approaches to create large-scale networks describing relationships in biological systems:

• Scaffold network: experimental detection of physical interactions between genes, proteins, and other molecules, or curation of such findings from literature.

• Data-driven network: de novo prediction of relationships between molecules based on pattern mining in diverse data sets (for example, search for expression similarities, typically indicative of some sort of functional relationships between genes, or text mining for co-citations of genes in compendia of literature)

Both approaches have their advantages. The CBDD will provide infrastructure to load pre-existing networks from elsewhere and apply specific algorithms for data-driven network generation.

The selection of an appropriate network analysis algorithm is a challenge. In the last few years, dozens of various network and pathway analysis approaches have been published in the peer-reviewed literature. However, using them requires understanding their required inputs (data and network), assumptions they make about the biology, and the goals that they intend to achieve. Furthermore, each new algorithm is a new learning curve, and implementations are not always readily available.

The goal of CBDD is, first of all, to select the state- of-the-art algorithms that could be beneficial for drug development research and, second, implement them in uniform fashion, providing a robust and easy-to-use software package.

Page 4: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 4

DATA AND NETWORK FORMATS

There are several major input formats for CBDD algorithms (outlined in Figure 1). The algorithms differ in their input requirements (some require inputs to have specific properties, while some are very general).

The OMICs data formats are:

• Start nodes: a list of genes somehow associated with the phenotype of interest. The start nodes may have associated activity/abundance change values (e.g., expression fold changes) and also p-values showing significance of their phenotype association.

• Matrix: a matrix of whole genome measure- ments, e.g., gene expression signals, plus typically the vector of phenotypes for samples to facilitate intergroup comparison.

The basic network format is simply an adjacency list – two-column matrix showing who interacts with whom. Network edges may have additional attributes:

• Direction. Some algorithms require directed network and will imply that a signal goes from node 1 to node 2

• Edge type (effect) Some algorithms utilize information on whether interactions result in the activation or inhibition of their target nodes

• Mechanism. Some algorithms may differentiate between different types of interactions (e.g., physical binding vs. transcription regulation links)

• Weight. Edges may be weighted by confidence.

Pathways are small subnetworks, typically much better studied than the rest of the network. The same network attributes described above are fully applicable in them.

Figure 1. CBDD input charts

Page 5: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 5

ALGORITHM AREAS

The network analysis algorithms do different things and might be utilized for different purposes. It is often hard to classify the algorithms into

precisely defined categories. CBDD uses a classification based upon the end goal (roughly corresponding to the typical research needs in drug development):

AREA DESCRIPTION EXAMPLE PURPOSE

Data-driven network De novo generation of relationships between biological entities (e.g., from similarities in gene expression profiles).

Any purpose

Edge Prioritizaiton Weighting and adjustment of networks, making them more specific to a particular biological context.

Any purpose

Node prioritization Learn which nodes in the network are well connected to the nodes of interest and might regulate the phenotype.

Drug target identification

Subnetwork prioritization Find modules in networks that are associated with phenotype.

Mechanism reconstruction; biomarker discovery

Pathway prioritization Learning which of the canonical signaling pathways are associated with phenotype provides good clues to the molecular mechanisms behind it and may help with biomarker search.

Mechanism reconstruction; biomarker discovery

Unsupervised analysis Learn how patients stratify into the subtypes and which networks and pathways drive this stratification.

Patient stratification

Integrative analysis Many OMICs data types are routinely available, and there is often a need to understand how they talk to one another. Networks can be utilized for answering questions such as “which mutations in my data set affect the differential expression in disease?”

Any purpose

Network comparison Compare mechanisms underlying different diseases or disease models.

Mechanism reconstruction

Page 6: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 6

IMPLEMENTATION

The priorities in the development of CBDD are as follows:

• Convenience. CBDD is implemented in the conventional form of an R package, making it accessible to wide ranks of bioinformaticians without extensive computer science training. Additional functions for importing networks from different sources and for exploring analysis results will be added.

• Generality. CBDD is focused on implementing a wide range of general purpose tools rather than algorithms specifically tied to certain experimental design or data type.

• Reliability. Algorithms are extensively tested; the detailed description of steps with toy inputs and outputs are provided.

• Performance. Computationally intensive parts of the algorithms are implemented in Java behind the scenes for improved overall performance.The aim is to keep the runtime low even for computationally complex algorithms applied to biologically relevant input data sizes, while still keeping as much code in R as possible.

• Modularity. The algorithms are often modular in nature [1], and users might want to tweak the modules. CBDD encourages this by modularizing the algorithms as much as possible, making separate R function for each step, and encouraging users to reuse these parts and build custom analysis workflows.

NETWORK GENERATION

SCAFFOLD NETWORK

A scaffold network is a global interaction network that contains interactions detected in high- and low-throughput experiments. These interactions may be detected in various tissues and conditions. The scaffold network is therefore a collection of interactions occurring in a variety of contexts.

Scaffold Network from MetaBase. The MetaBase resource is an example of a high-quality scaffold interaction network. MetaBase contains manually curated interactions obtained from small-scale experiments as well as interactions from manually extracted linear pathways. In MetaBase, molecules are defined as network objects, which describe the type of gene product such as kinases, and receptors. Furthermore, network objects may correspond to groups of molecules including protein complexes and protein families. Access to MetaBase network is licensed; the license is not included in CBDD scope.

Scaffold Network from Public Scources. As a part of CBDD scope we will collect and deliver interactions from the following public sources:

• BioGRID [2], an interaction repository that contains manually curated protein and gene interactions compiled from literature search.

• IntAct [3], an open source database of molecular

interaction data retrieved from experimental studies mostly based on yeast2hybrid screens. All interactions contained in IntAct are derived from literature curation or directly deposited by users.

• HumanNet [4], a probabilistic functional network of 18,714 protein-encoding human genes. It is constructed by Bayesian integration of 21 types of ‘OMICs’ data from multiple organisms. Each interaction in HumanNet has an associated log-likelihood score (LLS) that measures the probability of an interaction representing a true functional linkage between two genes.

• PIPs [5], a database of predicted human protein-protein interactions. The predictions have been made using a naïve Bayesian classifier to calculate a probability score for each interaction. The interaction probability between two proteins is calculated by combining different features including co-expression, orthology, co-localization, domain co-occurrence, PTMs, and network analysis.

• Hippie [6], a database of predicted human protein-protein interactions. Each interaction is assigned a confidence score based on the amount and reliability of evidence supporting each interaction. This score is calculated as a weighted sum of the number of studies in which an

Page 7: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 7

interaction was detected, the number and quality of experimental techniques used to measure an interaction and the number of nonhuman organisms in which an interaction was reproduced.

We will combine interactions annotated for Homo sapiens and reannotate the diverse identifiers used throughout the databases to the Entrez Gene IDs. The combined network will be updated with each phase of CBDD delivery. The current number of edges corresponds to the Phase 2 update, accomplished in October 2014.

Note: The resulting network will include only undirected edges without attributes. These interactions are not applicable for CBDD methods, such as causal reasoning of hidden nodes, which require a directed network.

INTERACTIONS

BioGRID 142,811*

IntAct 88,975*

HumanNet 476,399

Hippie 160,925

PIPs 91,050

Total (Union) 713,780

Human genes covered 19,097

*Only interaction between human proteins included in these statistics

SCAFFOLD NETWORK from STRING database

STRING database [7,8] is a widely known resource on interactions. The interaction evidence is based on the following criteria:

• High-throughput experimental evidence

• Text mining and literature curation

• Genomic context (gene fusions, etc.)

• Co-expression

A confidence score for each predicted association, derived by benchmarking the performance of the predictions against a common reference set of trusted, true associations.

We will provide full STRING interaction data sets for human, mouse, and rat.

PUBLIC NETWORK DELIVERY FORMATS

1. Format 1. Tab-delimited file(s) or

2. Format 2. Integration of public interaction content into MetaBase.

Public interactions content can be integrated into MetaBase allowing usage of the existing database infrastructure and leveraging the integrity of MetaBase content. This integration will be delivered as an extension of MetaBase (additional schemas for custom content and for automatic content integration), and ETL procedures to upload custom interactions into it. The infrastructure created defines unifying format of interactions compatible with the MetaBase data model and providing an integration report. This is being offered as a one-time upload of public interaction content at the start of the CBDD Core project. Any subsequent updates of the public content in MetaBase are beyond the scope of this project (please see “Project Fee section” for more detail).

Algorithms implemented in CBDD will support the delivery formats equally.

Page 8: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 8

EDGE PRIORITIZATION: ADDING CONTEXT-SPECIFICALLY TO THE SCAFFOLD NETWORK

The scaffold network contains all interactions that have been experimentally detected regardless of the context of the interaction. Filtering the interactions based on their context leads to a molecular network that is specific to a certain condition.

ADDING CONTEXT-SPECIFICITY BY NODE FILTERING

Gene expression data can be used to filter the scaffold network based on gene presence and absence in a given condition. The context-specific network is restricted to genes that are detected as present in the expression data and interactions between present genes. Genes not measured in an expression data-set can either be removed from or retained in the network.

ADDING CONTEXT-SPECIFICITY BY EDGE REWEIGHTING

Node filtering can significantly alter the topology of the scaffold network and have a large impact on network-based computational approaches. Due to the high level of noise in expression data, genes and interactions may wrongly be removed introducing errors in the network structure. Therefore, instead of removing nodes from the scaffold network, the interactions can be reweighted based on the gene presence and absence in the gene expression data. An interaction between two present genes should be assigned a high weight, while all other interactions should be down-weighted.

MISSING NODE VALUES

By default, all network nodes without measured values will be treated as “unknown” and left in place:

1. In the node-filtered network, we remove all nodes that are explicitly measured as absent.

2. In the edge-reweighted network, we down-weigh all edges that involve absent nodes.

In our experience, treating nodes without measurements as absent has proven to work well, but it might disrupt the networks by excluding the nodes which could not be measured (for example, compounds in MetaBase network). Therefore, the default option is to keep unknown nodes. However, we will also provide options to remove them.

DATA-DRIVEN NETWORKS

Data-driven network reconstruction aims at identifying regulatory interactions between genes solely based on experimental data. Data-driven networks can be based on large-scale microarray experiments that allow for extracting correlated gene expression patterns [9–12]. The idea behind data-driven network reconstruction is to accurately model the interactions occurring in a particular context or condition instead of using a scaffold network that contains interactions observed in many different conditions.

However, data-driven networks often suffer from extensive computation efforts, as well as the demand for large-scale experimental data. Furthermore, the networks may result in a mixture of physical and indirect interactions.

CORRELATION-BASED NETWORK

Data-driven networks can be reconstructed from large-scale experimental data-sets. A computationally inexpensive approach is to compute the expression correlation between all pairs of genes. Genes with similar expression profiles in a multitude of expression samples can be regarded as functionally related and potentially directly interacting. Therefore, by applying a pre-defined correlation threshold, gene expression correlations can be translated into interactions in the correlation-based network.

Page 9: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 9

COMBINING SCAFFOLD AND DATA-DRIVEN NETWORKS

Scaffold and data-driven networks by themselves have both advantages and disadvantages. An advantage of scaffold networks is the quality of the interactions when manually curated from low-throughput experiments as available in the MetaBase resource [13]. However, scaffold networks do not reflect all interactions that are present in a particular context. Filtering network objects or reweighting interactions based on gene expression profiles in a particular condition adds important information on context-specificity to the network. Nevertheless, adding expression-based constraints to a scaffold network will only allow for a reduction in the number of false positives present in the network. False negatives, i.e., interactions that have not yet been discovered and therefore not included in the scaffold network, cannot be identified. Data-driven approaches, on the other hand, rely solely on experimental data available for a condition of interest. While they aim at inferring all functional relationships between gene pairs, they cannot distinguish between physical and indirect interactions and may thus include too many interactions into a network. Therefore, integrating scaffold and data-driven interactions is a logical step forward in the generation of molecular interaction networks as it enriches the network of physical interactions with potentially unknown interactions that are present in a specific condition only.

Scaffold Interactions Enriched with Data-Driven Interactions

Scaffold networks, while often manually curated, may contain a large number of false negatives, i.e., interactions that have not been validated to date. Adding high-confidence data-driven interactions to the scaffold network can thus improve the overall coverage of existing interactions. Gene pairs exhibiting highly correlated expression patterns across a multitude of conditions indicate potential physical interactions between them. Therefore, a predefined number of data-driven edges per gene pair should be integrated into the scaffold network.

Context-Specific Scaffold Enriched with Data-Driven Interactions

While a context-specific scaffold network highlights a subset of interactions that are present in a particular condition, it cannot infer unknown interactions. Therefore, data-driven interactions can be added to the context-specific scaffold network based on the gene expression correlation in the experimental data. The idea is to add a limited number of data-driven edges for each pair of genes whose expression correlation is above a predefined threshold. Highly correlated expression patterns indicate that the genes are closely related in function and are thus likely to physically interact.

ARACNE

ARACNe [14] infers interactions from the gene expression data sets. It first identifies co-regulated gene pairs of high statistical significance by using mutual information (MI); and then prunes possible indirect links. ARACNe can reconstruct a hierarchical and scale-free network and looks good in the comparisons.

The co-regulation is captured using MI, an information theoretic quantity capable of finding nonlinear relationships between variables. MI computes the differential entropy between GEPs, and for a pair of random variables, Xi and Xj , is defined as:

Ii , j = S (Xi ) + S (Xj) - S (Xi ,Xj ),

where S(t) is the entropy of ; defined as S(t) = – ∑i p (ti) logp(ti) for a discrete variable t.

After the interactions with significant MI have been found, indirect edges are pruned using the data processing inequality (DPI). DPI states that if genes g1 and g3 interact only through a third gene, g2, (i.e., if the interaction network is g1 - g2 - g3 and no alternative path exists between g1 and g3), then I(g1, g3 ) ≤ min (I(g1, g2), I(g2 , g3))

Correspondingly, ARACNE starts with a network graph where each Iij >I0 is represented by an edge (ij). The algorithm then examines each gene triplet for which all three MIs are greater than I0 and removes the edge with the smallest value.

Page 10: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 10

KUFFNER ET AL. NETWORK RECONSTRUCTION

This approach [15] to network inference was one of the top performers in the DREAM5 network reconstruction challenge [16]. It is based on the assumption that transcription factors (TFs) and their corresponding target genes (TGs) exhibit mutual expression dependencies in at least a subset of the measured experimental conditions (time points, perturbations, etc). Candidate interactions between TFs and TGs are ranked by a score s. Score can be any measure of dependency between the expression of profiles (e.g., correlation or MI).

Candidate interactions are evaluated using η2, a nonparametric, nonlinear correlation coefficient obtained from a two-way analysis of variance (ANOVA). It is fast and does not require discretization of the input data.

This method was used for predicting transcription regulation links in the context of the DREAM5 challenge, but it can be generalized. As only ranking was needed, no significance testing for edges was applied, but that can be added as well.

NETWORK ANALYSIS APPROACHES

NODE PRIORITIZATION (LOCAL TOPOLOGY)

Local methods make use of the network neighborhood of disease-associated genes (e.g., derived from OMICs data) to prioritize novel candidates.

NEIGHBORHOOD SCORING

Neighborhood Scoring prioritizes candidates based on the distribution of differentially expressed genes in the network [17]. We adapted the method such that every network object is assigned a score, which is based partly on its expression fold change and partly on the expression fold changes of its neighbors. First, the differential expression levels of the genes are mapped to the corresponding network objects. Next, an adjusted differential expression level, the score, is calculated for each network object as follows:

The score of network object i depends on its fold change (FC) and on the fold changes of its neighbors n, where N(i) includes all neighboring network objects of i. The importance of the two terms is weighted by the weighting factor α. Network objects that are not differentially expressed and that do not have any differentially

expressed genes in their direct neighborhood are assigned a score of 0.

GUILT BY ASSOCIATION

Guilt by association methods assume that genes that are in close network proximity to a known disease gene are likely to be involved in the disease as well. Candidates can be prioritized by their network neighborhood of known disease genes [18]:

Each candidate is scored based on the number of disease-associated genes in the network neighborhood, DN(i), compared to the total number of neighbors N(i).

Page 11: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 11

INTERCONNECTIVITY

Interconnectivity prioritizes candidates based on their overall connectivity to the differentially expressed genes [19]. First, an interconnectivity score is calculated for each pair of interacting network objects. The interconnectivity score is based on both the direct interaction between a pair and the indirect interactions with a path length of two, which we define as the shared neighborhood of two network objects. We adapted the method to score interactors of differentially expressed genes based on their direct interaction and on their shared neighborhood as follows:

e(i,j) describes an edge between the two network objects i and j. It is set to 1 if the edge exists and 0 else. Besides the direct interaction between i and j, the size of their shared neighborhood N is taken into account and normalized by the overall degrees of the two network objects.

Next, each network object receives its final score based on the interconnectivity to all differentially expressed genes:

where d represents a differentially expressed gene and DEG the set of all.

HIDDEN NODES

Hidden Nodes is a shortest-path-based method that identifies candidates based on their topological significance [20]. First, a directed shortest-path network is constructed that connects all differentially expressed genes. This network contains the differentially expressed genes as well as all network objects on shortest paths between them. Next, each network object i in the shortest-path network is scored based on its topological significance for connecting the differentially expressed genes. Under the null hypothesis, network object i has no special role in connecting a differentially expressed gene j to the rest of the differentially expressed genes. Using a hypergeometric test, the significance of network

object i can be assessed by comparing the number of shortest paths between the differentially expressed genes that contain network object i to the total number of shortest paths.

CAUSAL REASONING

Causal Reasoning is a shortest-path-based method that aims at the identification of upstream regulators that cause gene expression changes observed in transcriptomics data [21]. Causal Reasoning relies on a directed network that is annotated with activation and inhibition edges. Causal Reasoning identifies candidates in the network that can be reached via a pre-defined maximum shortest-path length from the differentially expressed genes. Candidates are scored based on the number of differentially expressed genes that can be reached via the shortest paths (enrichment) and the correctness of the regulation. The correctness is assessed based on the activation and inhibition edges along the paths and the expected and real direction of fold changes of the differentially expressed genes.

Although Causal Reasoning scores all candidates that can be reached at a specified shortest- path length, post-filtering of the results is a requirement. As proposed by the developers, a score threshold of 3, a correctness percentage of 60%, and enrichment and concordance p-values below 0.05 are reasonable thresholds.

SIGNET

The causal reasoning algorithm was systematically evaluated in [22], and several new metrics for scoring the upstream regulator nodes were suggested in this work.

• “Lambda” score only reflects how many genes could be reached from node A, and the average shortest path distance from node A to the signature genes ∈ S :

where Ind(A,B) = 1 if B is reachable from A in “maxSteps” steps and 0 otherwise.

Page 12: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 12

• “Power weights” combines absolute fold changes (AFCs) of targets and and length of the shortest path from regulator: Pw(A)= ∑B∈SAFC(B)SP(A,B)

• “Exponential power weights” is similar, but the contributions of fold changes and shortest paths are different: Ew(A)= ∑B∈Se–SP(A,B)∙AFC(B)

The SigNet is a consensus scoring approach. It computes causal reasoning rankings using the above-mentioned metrics and then takes the best rank across the three as the final “score” for an upstream regulator. Tested on ConnectivityMap [23] data, the algorithm correctly predicted a true regulator (compound) in top 0.5% of ranked nodes in 60% of tested cases.

NODE PRIORITIZATION (GLOBAL TOPOLOGY)

Global methods take the complete network topology into account for prioritizing novel candidates.

NETWORK PROPAGATION

Network Propagation is a flow-based method that prioritizes candidates by smoothing disease-associated information over the network [24]. The scoring of the network objects can be regarded as propagating flow through the network. The starting points of the flow correspond to the differentially expressed genes and are assigned a flow of 1, while the remaining network objects are assigned a flow of 0. These flow assignments represent the prior knowledge of the condition and are smoothed over the network to prioritize candidates that are in close proximity to all differentially expressed genes. The scoring is done by simulating an iterative process where flow is pumped from the starting points to their network neighbors. In addition, every network object propagates the flow received in the previous iteration to its neighbors. The iterations are repeated until a steady state is reached. The final flow that each network object received corresponds to its final score and defines the rank of the object in the list of candidates. In each iteration, the flow for the network objects is updated as follows:

Ft is a vector containing the flow for each network object at time point t. A’ corresponds to the adjacency matrix of the graph, where each entry is normalized by the degrees of the source and target nodes. The normalization by node degrees compensates for the fact that nodes with many interactors have a higher chance of picking up flow by chance and are thus more likely to be ranked higher in the prioritization. F0 represents the prior knowledge vector containing the scores for differentially expressed genes. The algorithm terminates when the L1 norm of the difference between Ft and Ft-1 drops below 10-6.

RANDOM WALKS

A random walk describes the transition of a random walker through a network [25]. In a random walk, a set of network objects is defined as starting points, corresponding to the differentially expressed genes here. In each iteration, the random paths are extended by transitioning to an adjacent network object with equal probability. Additionally, a random walk has a certain probability of terminating and restarting from the starting points. In each step, the network objects are assigned probabilities describing the chance of a random walk traversing this object. Upon convergence of the probabilities, the network objects are ranked by their visitation probabilities. Network objects with high probability scores are most proximal to all starting points and are considered candidates:

Pt is a vector containing the visitation probabilities for all network objects at time point t. A’ describes the normalized adjacency matrix of the network, which has been transformed into a stochastic matrix. P0 represents the vector of starting points for the random walk, where each network object corresponding to a differentially expressed gene is assigned the same starting probability. Finally, α is a weighting factor, assigning a certain probability for the random walk to continue and for a restart from the starting points.

Page 13: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 13

TOPPNET

ToppNet is a suite of several algorithms for node prioritization (PageRank, HITS, K-step Markov) [26]. The first two methods are quite similar to network propagation (with HITS implementing a directed version of the flow propagation); the third one is substantially different. The K-Step Markov approach computes the relative probability that the system will spend time at any particular node given that it starts in a set of start nodes R and ends after K steps. The value of K controls the relative “bias” toward R. The equation to compute the K-Step Markov importance is I(t|R) = [ApR + A2pR ... AKpR]t, where A is the transition probability matrix of size n × n, pR is an n × 1 vector of initial probabilities for the root set R, and I(t|R) is the t-th entry in this sum vector.

SUBNETWORK PRIORITIZATION

Module-based methods identify subnetworks that are enriched for disease-associated genes. Candidates correspond to those network objects within the subnetworks that have not yet been associated with the disease.

ACTIVE MODULES

The Active Modules approach aims at the identification of subnetworks that exhibit joint gene expression changes between different conditions [27]. The biological activity of a subnetwork is rated by first assessing the significance of differential expression for each gene. The significance per gene can be expressed as z-score and the overall activity of a subnetwork corresponds to the sum of z-scores for the contained genes. The computationally challenging part is the identification of the highest-scoring subnetwork, i.e., the biologically most active subnetwork. To reduce computational time, Active Modules makes use of the simulated annealing heuristic. In practice, this approach is not guaranteed to find the optimal subnetwork; however, all high-scoring subnetworks are biologically interesting regardless of whether they are strictly maximal. The simulated annealing heuristic works by randomly selecting a set of starting points in the network, corresponding to the initial subnetwork. In each iteration that follows, a new network object is added to the subnetwork. The new object is kept if the

score of the subnetwork improves, and is kept or disregarded with predefined probabilities otherwise.

SUBNETWORK MARKERS

The approach aims to identify disease-associated subnetworks that allow for patient stratification [28,29]. Subnetwork-based markers are more reproducible between patients than single genes and network-based patient classification achieves higher accuracy in prediction than gene-based classification. The activity of a subnetwork in two conditions is inferred by 1) transforming the gene expression values to z-scores and 2) separately summing up the subnetwork z-scores for the conditions. High-scoring subnetworks are identified using a greedy search: each subnetworkstarts with a single high-scoring node (seed) and is iteratively expanded. In each iteration, the gene with the highest score is added, provided that it is within a predefined range of the seed and that it increases the subnetwork score over a predefined improvement rate.

DYSREGULATED SUBNETWORKS (DEGAS)

DEGAS aims to detect subnetworks in which multiple genes are dysregulated in a given condition, while allowing for distinct affected gene sets in each condition sample [30]. The input to DEGAS is a number of gene expression samples for a specific condition and a number of gene expression samples for the corresponding control. First, the expression data are converted into a binary “genes over condition samples” matrix in which “1” appears in position (i,j) if gene i is dysregulated in condition sample j (relative to the expression levels of i in the control cohort). The goal of the method is then to identify the smallest subnetwork in which at least k genes are dysregulated in all but l cases. DEGAS thus has two main parameters: k - the number of genes affected in the subnetwork in each condition sample, and l - the number of allowed outliers, i.e., condition samples excluded from the analysis. As for the Active Modules, DEGAS uses a heuristic to identify dysregulated subnetworks. The method starts with from all network objects simultaneously and uses a greedy search heuristic to iteratively expand the best current solution.

Page 14: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 14

PATHWAY INFERENCE FROM GENE LISTS

The method takes as input a set of query genes that arise from an omics experiment and an interaction network. The goal is to extract a high-scoring subnetwork of genes and interactions between them from the network [31]. The subnetworks found by this algorithm predominantly consist of genes contained in the input set, but they can contain some additional genes. The genes contained in the highest-scoring subnetwork along with the interactions between them will provide useful insight into the mechanism of action underlying the omics experiment that gave rise to the query set of genes. The subnetwork scoring builds on the ActiveModule approach and requires a significant p-value for each gene. However, the scoring function is modified compared to the original algorithm such that the resulting subnetworks have relatively small sizes. To identify the high-scoring subnetworks, a greedy search is used. Here, the highest-scoring nodes are grouped into connected components and iteratively expanded. In the expansion step, a limited number of nodes with low scores may be included in the subnetwork and be kept if this leads to the merging of two connected components.

DENSE

DENSE is a method that, similar to MCODE, detects densly connected subnetworks [32]. The main difference is that the approach takes prior knowledge of the biology as input, i.e., information on differentially expressed genes, mutated genes or other types of OMICs data. The prior information is first mapped to the corresponding network objects, which serve as the starting points for the approach. Based on the information, DENSE identifies all topologically dense functional subnetworks in the molecular interaction network that contain a number of the starting points. The subnetworks are then prioritized based on their density (i.e., the number of interactions within the subnetwork) and the enrichment (i.e., the number of starting points contained in the subnetwork).

CASNET

CASNet is a method for inferring active subnetworks from a directed interaction network using a combined node and edge scoring approach [33]. Previous methods for identifying active subnetworks mostly use undirected PPI networks and node-centric approaches, which

can limit their ability to find the meaningful subnetworks. CASNet requires two inputs: 1) a network with directed and annotated edges (i.e., activation and inhibition) and 2) a gene expression data set with differentially expressed genes, their fold changes, and significant p-values. The idea is to find the highest-scoring subnetwork whose edge annotations are consistent with the fold changes in the expression data. For example, if n1 promotes n2, i.e., n1 → n2, and both n1 and n2 are up- or down-regulated, then the edge e = (n1 → n2) is considered to be consistent with the data.

The subnetwork scoring function incorporates node and edge scores: The node scoring function incorporates the DEG p-values, and furthermore penalizes nodes with high degree. The edge scoring function assigns high scores to edges whose annotations are consistent with the experimental data. The final subnetwork is an additive score of the two scoring functions.

MCODE

MCODE is a graph theoretic clustering algorithm that detects densely connected regions in molecular interaction networks that may represent functionally relevant subnetworks [34]. The algorithm operates in three stages, namely vertex weighting, subnetwork prediction and optionally post-processing to filter the subnetworks by certain criteria. In the first stage, vertex weighting, all network objects are assigned a weight based on the density of their network neighborhood. In the second stage, the network objects with the highest weights are regarded as seeds and the algorithm iteratively expands the subnetworks around the seeds. The extension is a greedy procedure in which the neighboring network objects with the highest weights are added to the current subnetworks.

MCODE does not require any OMICs data per se, but identifies subnetworks based on the topological features of a network. In a post-processing step, the identified subnetworks can be filtered based on the OMICs data of interest, e.g., a certain percentage of dysregulated genes in a subnetwork may be required.

Page 15: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 15

HOTNET

HotNet is a method for the de novo identification of subnetworks in a molecular interaction network that contain more mutated genes than expected by chance in a statistically significant number of patients [35,36]. This problem differs from the identification of dysregulated subnetworks based on gene expression because a relatively small number of genes might be measured, a small subset of genes in a subnetwork may be mutated, and a single mutated gene may be sufficient to perturb a subnetwork.

The HotNet approach consists of two main phases. First, each pair of network objects is assigned an “influence” value. The influence measure quantifies the influence a network object has on another. The influence is based on the proximity of the two network objects in the molecular network and also on the number of paths that connect them. Based on the influence measure, an influence network is built and only those network objects tested for mutations are kept. In the second phase, there are two options for identifying significantly mutated subnetworks. One option is to identify subnetworks that contain a large number of mutated genes. The other option is to enhance the influence network by weighting the network object pairs by the number of mutations observed on these genes.

PATHWAY PRIORITIZATIONThe subnetwork approaches presented in the previous section identify biologically relevant modules in a large molecular interaction network. However, their interpretation can be difficult, especially if a subnetwork spans multiple functional pathways and biological processes. Pathway-based methods, however, operate on a set of predefined pathways. Although limited by the accuracy of currently known pathways, they have the advantage of identifying dysregulation among them and novel candidates correspond to pathway genes that have not yet been associated with a certain disease.

SPIA

SPIA (signaling pathway impact analysis) aims at the identification of perturbed pathways in a given condition by combining enrichment of perturbed genes in the pathway with the actual amount of perturbation, leading to the most promising candidate pathways and thus candidate genes

[37]. SPIA captures two different probabilities for each pathway: 1) the enrichment of differentially expressed genes within the pathway and 2) the level of perturbation within the pathway as measured by propagating expression changes through the pathway. The enrichment can be calculated by applying a simple hypergeometric test. To estimate the level of perturbation within a pathway, a perturbation factor is calculated for

each gene as follows:

FC represents the signed expression change of gene i. The second part of the equation is the sum of perturbation factors of all genes j that are directly upstream of gene i, normalized by the number of downstream genes of each such gene j. β reflects the type of interaction between genes i and j: in case of an activation edge, it is set to +1 and in case of inhibition to -1.

The net perturbation accumulation at the level of gene i is calculated as the gene’s perturbation factor minus its observed fold change. The overall pathway perturbation is computed as the sum of all perturbation accumulations. Finally, the probability of observing this pathway perturbation is calculated.

The pathways are then ranked by the combination of the two probabilities:

Where PG represents the overall pathway perturbation probability and ci the product of the two different probabilities described before.

Page 16: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 16

PATHWAY ACTIVITY INFERENCE

The method identifies pathway-based markers that can be used for classification by inferring the activity of a given pathway based on the expression levels of the constituent genes [38]. The pathway activity is calculated based on the log likelihood ratio (LLR) between two phenotypes of interest for each pathway gene. The LLR is calculated using the conditional probability density function (PDF) of the expression level of each pathway gene under phenotype 1 compared to the conditional PDF under phenotype 2. Since the LLR is computed based on the difference in distribution of the gene expression values under different conditions, the direction and the amount of expression changes do not have large effects on the overall discriminative power of the pathway marker.

UNSUPERVISED NETWORK ANALYSISThese algorithms using pathway and network information in unsupervised analysis setting, helping to identify clusters of patients possibly corresponding to disease subtypes.

Most of these algorithms use “network-aware” biclustering, identifying clusters of topologically close genes that have distinct expression patterns in a subset of patients. Another type of such methods works with pathways, calculating patient-wise pathway scores matrix for following exploration using conventional unsupervised analysis approaches.

NBS

This algorithm was developed to stratify cancer patients based on their somatic mutation profiles [39]. It was able to recover histological subtypes of ovarian cancer based on somatic mutation profiles. The unsupervised analysis part is just one step in the original algorithm (network-constrained NMF). Initially used on mutation data transformed by network propagation, it can be adapted for different use cases.

NMF stands for non-negative matrix factorization, a technique allowing for decomposing a matrix F of signals into the product of two matrices of smaller dimensions (W and H). Network-regularized NMF is an extension that constrains NMF to respect the structure of an underlying gene interaction network. This is accomplished by minimizing the following objective function using

an iterative approach:

min ||W – FH||2 + trace(WtKW) W,H>0

W and H form a decomposition of the patient × gene matrix F (resulting from network smoothing as described above) such that W is a collection of basis vectors, or “metagenes” (defining subtypes and contributions of genes to each subtype), and H is the basis vector loadings (defining patient classification into subtypes). The trace(WtKW) function constrains the basis vectors in W to respect local network neighborhoods. The term K is an adjacency matrix of a “nearest neighbors” network derived from the original network.

NCIS

NCIS [40] combines gene network information to simultaneously group samples and genes into biologically meaningful clusters. Prior to clustering, genes are weighted based on their impact in the network (a directed network propagation algorithm is applied to the median absolute deviation values of genes). Then a weighted co-clustering algorithm is applied (authors used a new weighted co-clustering method, Semi-Nonnegative Matrix Tri-Factorization, to simultaneously separate samples into subtypes and group genes into functionally relevant subclasses).

INTEGRATIVE NETWORK ANALYSIS

Algorithms that utilize the network information for joint analysis of several different high throughput data sets, mostly intended to recover mechanisms by which events at different molecular levels (such as DNA mutation and gene expression changes) might interact with each other in disease or other phenotype [41].

TIEDIE

TieDie [42] is one of the more general approaches that trys to infer the most likely paths in the network connecting the set of “cause” regulatory nodes S to the set of “consequences,” effector nodes T. For example, in the cancer setting, S may correspond to genes involved in genomic alterations–mutations, deletions, and amplifications–whereas T may correspond to

Page 17: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 17

genes involved in transcriptional responses.

TieDie algorithm extends the network propagation ‘flow’ strategy by using two diffusion processes in directed network and then identifying overlapping regions with high scores, providing connectivity between S and T. A score reflecting the nodes’ importance in this setting can be computed as z=f(r(x,A),r(y,AT)), where r(x, A) is the topological significance score computed from a set of start nodes x and adjacency matrix A. Transpose of the adjacency matrix is used to force the diffusion to proceed upward from the targets by supplying a graph containing reversed edges.

The function f() is chosen to assign high relevance scores to nodes where both r(x,A) and r(y,AT) are high and lower scores when either of the two are low (in the simplest case, f can just be the minimum of the two scores.

A set of linking genes is obtained by thresholding the linking scores using a chosen value α selected to guarantee a desired level of specificity. A subnetwork is then generated, containing direct interactions connecting subsets of S, T. and linking genes.

EQED

This approach [43] is one of the eQTL-inspired methods. eQTL is a GWAS association of a variant with gene expression, and in the case where the variant is not located near the regulated genes, one needs to know 1) what is the causal variant (the associated variant may itself be just a positional marker and be in LD with the true causal variant nearby in the genome) and 2) how the regulation occurs. eQED is a flow-based method that tries to prioritize the true causal gene from a set of candidates and infer a pathway connecting them to the downstream regulated target gene.

The eQTL associations and the corresponding protein network are abstracted as an analog electric circuit model grounded at a given target gene. The weights on the edges of the molecular network are modeled as conductances (1/resistance) in the electric circuit. The P‐values of association between each genetic locus and expression of the target are modeled as independent sources of current. An electric circuit abstraction is constructed for every locus–target association. After solving the circuit for currents, the causal gene is predicted as the one with the highest current running through it. Analyzing the network as an electric circuit provides a

deterministic “steady‐state” solution, in contrast to a stochastic random walk.

HUANG ET AL.

All based on prize-collecting Steiner tree principle [44]. The Steiner tree problem works with a weighted graph and a set of start nodes and tries to find a minimum-weighted tree connecting as many start nodes as possible.The algorithm balances two costs: (i) It pays a cost for leaving a terminal out of the network, and (ii) it pays a price for using edges to include a terminal in the network. Size of the solution network can be controlled by a single β that weights the penalties of excluding terminal nodes relative to the cost of including edges. More reliable edges have lower cost than less reliable ones, and penalties for excluding nodes can be based on node score (e.g., fold change). The solution is a subtree connecting start nodes with high-relevance edges, while possibly leaving some of the nodes out.

The algorithm can also work with several sets of start nodes, supplying different penalties for their exclusion.

PARADIGM (PARADIGM-SHIFT)

PARADIGM [45] takes a pathway and unfolds it to a Bayesian network with hidden and observed variables. Bayesian network inference approaches can then be used to calculate the hidden activities of nodes given observed integrated data for a patient.

The Bayesian network encodes the pathway using a number of random variables for each gene of a pathway and a set of functions (factors) that constrain the variables to take on biologically meaningful values as functions of one another. Each variable can take on one of three states corresponding to activated, nominal, or deactivated, relative to a control level (e.g., as measured in normal tissue). For each protein-coding gene G in the pathway, hidden variables are introduced to represent the copy number of the genome (GDNA), mRNA expression (GmRNA), protein level (Gprotein) and protein activity (Gactive). The factors reflecting the dependencies of the same gene’s variables are added (e.g. from GDNA to GmRNA). The interactions are encoded as dependencies of one gene’s variable on another gene (e.g., transcription regulation X–>Y is dependence between Xactivity and YmRNA). Finally, observation variables and factors are added to complete the pathway and enable inference of

Page 18: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 18

hidden variables’ states.

The Bayesian inference then can be used to compute a log-likelihood ratio L, which signifies our belief that an activity of entity i is up or down given a patient’s data. PARADIGM can produce a matrix of such inferred activities for every entity and every patient, allowing downstream analysis on patient level.

PARADIGM-SHIFT [46] allows estimating possible effect of mutation in gene G on the pathway. Loss of function is predicted when G’s downstream targets have activity consistent with a low activity of G relative to what is expected given its upstream regulators. Gain of function is predicted when the downstream regulators are consistent with a high activity of G but the upstream regulators are not.

NETWORK COMPARISON

DE-MAP

This is a simple algorithm for finding the topology changes between two networks (assessed as difference in corresponding edge weights). Edges with large enough weight differentials are selected. In the original publication [47]; an independent reference null distribution of score differences was used to calculate the p-values, but in our case a permutation scheme can be suggested to estimate the significance of rewiring. The algorithm can be useful to emphasize the differences between the mechanisms underlying different phenotypes.

EDIT DISTANCE

This is not strictly an algorithm, but a distance measure between two networks [48]. It is based on the computation of the minimum number of modifications required to transform an input graph into a reference one. Specifically, the distance measure is defined as the cost of recognition of nodes plus the number of transformations that include node insertion, node deletion, branch insertion, branch deletion, node label substitution, and branch label substitution. The measure can be useful, for instance, in comparing results of subnetwork prioritization algorithms run on different data sets.

INTEGRATION OF MULTIPLE METHODS

The integration of multiple methods for prioritizing novel candidates can improve the overall performance of the predictions [49]. While global methods tend to outperform local and module-based methods, each class of approaches captures different aspects of the network and results in unique candidate predictions. To capture the most information possible, the results of multiple methods can be integrated using a logistic regression model. Such a model requires two inputs: 1) a set of features for each network object and 2) a gold standard data set to train the regression model. The features correspond to the prioritized candidate lists from all or a subset of the previously described methods. The gold standard data set contains a set of true candidates, i.e., a set of known drug targets for the condition of interest, which can be obtained from our Integrity knowledge base. The output of the model is a prioritized list of candidates based on the evidence from the different network-based methods.

Page 19: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 19

DELIVERABLESDeliverables include perpetual access to the library of R scripts developed and improved under the CBDD partnership umbrella, as well as accompanying example data sets and networks, algorithm documentation, and trainings.

DELIVERABLE DESCRIPTION

R Package • Algorithm implementations• Example files• I/O and visualization functions

Documentation • Extensive user manual• Algorithm testing: detailed walkthrough for each algorithm with toy data• Performance evaluation of algorithm runtime with different inputs and options

Trainings • Workshops for all users after each new algorithm is released

QUALITY METRICS AND ACCEPTANCEUpon delivery of each script, it will be subject to the following quality criteria:

• Each script has a specification with described functionality, requirements, and limitations.

• The script works as described in the specification.

• The implementation of the methods is tested using both toy data and real biological data sets to verify that the methods work as intended.

The following acceptance criteria are used for each update of the package:

• A bug is defined as an implementation that breaks functionality described in the specification.

• If a written bug report is not received from CBDD members within 20 business days after the script was delivered, the update is deemed fully accepted by CBDD members.

• Any changes to deliverables after the acceptance of the script will be handled through a separate work order to the agreement, or addressed in a subsequent phase of the development mutually agreed by Thomson Reuters and CBDD members. Changes are defined as script modification in addition to specification. All other changes in functionality need a formal request to Thomson Reuters and approval from Thomson Reuters management, and may result in additional fees

Page 20: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 20

TIMELINE The following table outlines the estimated timeline for implementation of each algorithm.

AREA METHOD PHASE ESTIMATED TIME IMPLEMENTED*

Network generation Adding Context-Specificity to the Scaffold NetworkData-Driven Network (coexpression)Combining Scaffold and Data-Driven Networks

Phase 1 15 days Yes

Network Generation: Scaffold Network

Phase 1 15 days Yes

Kuffner et al. network reconstruction

Phase 4 10 days Yes

ARACNE Phase5 10 days Yes

Node prioritization Casual Reasoning Phase 1 3 days Yes

Network Propogation Phase 1 3 days Yes

Random Walk Phase 1 10 days Yes

Neighborhood Scoring Phase 1 2 days Yes

Guilt-by-association Phase 1 2 days Yes

Interconnectivity Phase 1 2 days Yes

Hidden Nodes Phase 2 12 days Yes

SigNet Phase 4 10 days Yes

ToppNet Phase 5 10 days Yes

Pathway prioritization Pathway activity inference Phase 1 15 days Yes

SPIA Phase 2 15 days Yes

Subnetwork prioritization

Active Modules Phase 1 15 days Yes

Subnetwork markers Phase 1 12 days Yes

Pathway Inference From Gene Lists

Phase 2 12 days Yes

DENSE Phase 2 12 days Yes

DEGAS Phase 3 18 days Yes

CASNet Phase 3 15 days Yes

MCODE Phase 3 10 days Yes

HotNet Phase 3 15 days Yes

Data integrationapproaches

TieDie Phase 4 15 days Yes

eQED Phase 5 20 days Yes

Huang et al (PCST) Phase 5 15 days Yes

PARADIGM (+PARADIGM-SHIFT) Phase 5 35 days

Network comparison dE-MAP Phase 4 10 days Yes

Edit distance Phase 4 10 days Yes

Unsupervized network analysis

NBS Phase 4 15 days Yes

NCIS Phase 4 10 days Yes

*As of September 2015

Page 21: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 21

The implementation time for a method is estimated under the assumption that the method is properly described in the corresponding publication.

In addition to the implementation and testing time for each method, the code will need to be thoroughly documented and integrated into a user-friendly package.

Documentation and workshops are estimated to take an additional five days per phase.

The algorithms will be delivered in nine phases, according to the current number of member companies (as of September 2015). Each phase includes four months of development. The Phase 4 was delivered in July 2015. Phase 5 is currently under way and will be delivered in Oct 2015.

The scoping process for Phase 6 – 8 will be finished in October – early November 2015.

REFERENCES1. Ackermann, M. & Strimmer, K. (2009). A general modular framework for gene set enrichment analysis. BMC bioinformatics 10, 47+.

2. Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz, A. & Tyers, M. (2006). BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–539.

3. Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron, M., Roechert, B., Roepstorff, P., Valencia, A., et al. (2004). IntAct: an open source molecular interaction database. Nucleic Acids Res. 32, D452–455.

4. Lee, I., Blom, U.M., Wang, P.I., Shim, J.E.E. & Marcotte, E.M. (2011). Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome research 21, 1109–1121.

5. McDowall, M.D., Scott, M.S. & Barton, G.J. (2009). PIPs: human protein-protein interaction prediction database. Nucleic acids research 37, D651–D656.

6. Schaefer, M.H., Fontaine, J.F., Vinayagam, A., Porras, P., Wanker, E.E. & Andrade-Navarro, M.A. (2012). HIPPIE: Integrating protein interaction networks with experiment based quality scores. PloS one 7.

7. Von Mering, C., Jensen, L.J., Snel, B., Hooper, S.D., Krupp, M., Foglierini, M., Jouffre, N., Huynen, M.A. & Bork, P. (2005). STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33, D433–437.

8. Franceschini, A., Szklarczyk, D., Frankild, S., Kuhn, M., Simonovic, M., Roth, A., Lin, J., Minguez, P., Bork, P., von Mering, C., et al. (2013). STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–815.

9. Markowetz, F. & Spang, R. (2007). Inferring cellular networks–a review. BMC Bioinformatics 8 Suppl 6, S5.

10. Lee, W.-P. & Tzou, W.-S. (2009). Computational methods for discovering gene networks from expression data. Brief. Bioinformatics 10, 408–423.

11. Margolin, A.A. & Califano, A. (2007). Theory and limitations of genetic network inference from microarray data. Ann. N. Y. Acad. Sci. 1115, 51–72.

12. Bansal, M., Belcastro, V., Ambesi-Impiombato, A. & di Bernardo, D. (2007). How to infer gene networks from expression profiles. Mol. Syst. Biol. 3, 78.

13. Bureeva, S., Zvereva, S., Romanov, V. & Serebryiskaya, T. (2009). Manual annotation of protein interactions. Methods in molecular biology (Clifton, N.J.) 563, 75–95.

14. Margolin, A.A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Dalla Favera, R. & Califano, A. (2006). ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7 Suppl 1, S7.

Page 22: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 22

15. Küffner, R., Petri, T., Tavakkolkhah, P., Windhager, L. & Zimmer, R. (2012). Inferring gene regulatory networks by ANOVA. Bioinformatics 28, 1376–1382.

16. Marbach, D., Costello, J.C., Küffner, R., Vega, N.M., Prill, R.J., Camacho, D.M., Allison, K.R., Aderhold, A., Allison, K.R., Bonneau, R., et al. (2012). Wisdom of crowds for robust gene network inference. Nature Methods 9, 796–804.

17. Nitsch, D., Gonçalves, J.P., Ojeda, F., de Moor, B. & Moreau, Y. (2010). Candidate gene prioritization by network analysis of differential expression using machine learning approaches. BMC Bioinformatics 11, 460.

18. Schwikowski, B., Uetz, P. & Fields, S. (2000). A network of protein-protein interactions in yeast. Nature biotechnology 18, 1257–1261.

19. Hsu, C.-L., Huang, Y.-H., Hsu, C.-T. & Yang, U.-C. (2011). Prioritizing disease candidate genes by a gene interconnectedness-based approach. BMC Genomics 12 Suppl 3, S25.

20. Dezso, Z., Nikolsky, Y., Nikolskaya, T., Miller, J., Cherba, D., Webb, C. & Bugrim, A. (2009). Identifying disease-specific genes based on their topological significance in protein networks. BMC Systems Biology 3, 36+.

21. Chindelevitch, L., Ziemek, D., Enayetallah, A., Randhawa, R., Sidders, B., Brockel, C. & Huang, E.S. (2012). Causal reasoning on biological networks: interpreting transcriptional changes. Bioinformatics (Oxford, England) 28, 1114–1121.

22. Jaeger, S., Min, J., Nigsch, F., Camargo, M., Hutz, J., Cornett, A., Cleaver, S., Buckler, A. & Jenkins, J.L. (2014). Causal Network Models for Predicting Compound Targets and Driving Pathways in Cancer. Journal of Biomolecular Screening 19, 791–802.

23. Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., Lerner, J., Brunet, J.-P., Subramanian, A., Ross, K.N., et al. (2006). The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science 313, 1929–1935.

24. Vanunu, O., Magger, O., Ruppin, E., Shlomi, T. & Sharan, R. (2010). Associating genes and protein complexes with disease via network propagation. PLoS computational biology 6, e1000641+.

25. Köhler, S., Bauer, S., Horn, D. & Robinson, P.N. (2008). Walking the interactome for prioritization of candidate disease genes. American journal of human genetics 82, 949–958.

26. Chen, J., Aronow, B.J. & Jegga, A.G. (2009). Disease candidate gene identification and prioritization using protein interaction networks. BMC bioinformatics 10, 73+.

27. Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A.F. (2002). Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics (Oxford, England) 18 Suppl 1, S233–S240.

28. Chuang, H.-Y., Lee, E., Liu, Y.-T., Lee, D. & Ideker, T. (2007). Network-based classification of breast cancer metastasis. Molecular systems biology 3.

29. Chuang, H.-Y., Rassenti, L., Salcedo, M., Licon, K., Kohlmann, A., Haferlach, T., Foà, R., Ideker, T. & Kipps, T.J. (2012). Subnetwork-based analysis of chronic lymphocytic leukemia identifies pathways that associate with disease progression. Blood 120, 2639–2649.

30. Ulitsky, I., Krishnamurthy, A., Karp, R.M. & Shamir, R. (2010). DEGAS: de novo discovery of dysregulated pathways in human diseases. PLoS ONE 5, e13367.

31. Rajagopalan, D. & Agarwal, P. (2005). Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics 21, 788–793.

32. Hendrix, W., Rocha, A.M., Padmanabhan, K., Choudhary, A., Scott, K., Mihelcic, J.R. & Samatova, N.F. (2011). DENSE: efficient and prior knowledge-driven discovery of phenotype-associated protein functional modules. BMC Syst Biol 5, 172.

Page 23: Computational Biology Methods for Drug Discovery_Phase 1-5_November 2015

COMPUTATIONAL BIOLOGY METHODS FOR DRUG DISCOVERY (CBDD) PHASE 1-5 PAGE 23

33. Gaire, R.K., Smith, L., Humbert, P., Bailey, J., Stuckey, P.J. & Haviv, I. (2013). Discovery and analysis of consistent active sub-networks in cancers. BMC Bioinformatics 14 Suppl 2, S7.

34. Bader, G.D. & Hogue, C.W.V. (2003). An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 2.

35. Vandin, F., Upfal, E. & Raphael, B.J. (2011). Algorithms for detecting significantly mutated pathways in cancer. Journal of computational biology : a journal of computational molecular cell biology 18, 507–522.

36. Vandin, F., Clay, P., Upfal, E. & Raphael, B.J. (2012). Discovery of mutated subnetworks associated with clinical data in cancer. Pacific Symposium on Biocomputing, 55–66.

37. Tarca, A.L., Draghici, S., Khatri, P., Hassan, S.S., Mittal, P., Kim, J.-S., Kim, C.J., Kusanovic, J.P. & Romero, R. (2009). A novel signaling pathway impact analysis. Bioinformatics (Oxford, England) 25, 75–82.

38. Su, J., Yoon, B.-J. & Dougherty, E.R. (2009). Accurate and reliable cancer classification based on probabilistic inference of pathway activity. PloS one 4, e8161+.

39. Hofree, M., Shen, J.P., Carter, H., Gross, A. & Ideker, T. (2013). Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115.

40. Liu, Y., Gu, Q., Hou, J.P., Han, J. & Ma, J. (2014). A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression. BMC Bioinformatics 15, 37.

41. Mitra, K., Carvunis, A.-R., Ramesh, S.K. & Ideker, T. (2013). Integrative approaches for finding modular structure in biological networks. Nat Rev Genet 14, 719–732.

42. Paull, E.O., Carlin, D.E., Niepel, M., Sorger, P.K., Haussler, D. & Stuart, J.M. (2013). Discovering causal pathways linking genomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE). Bioinformatics 29, 2757–2764.

43. Suthram, S., Beyer, A., Karp, R.M., Eldar, Y. & Ideker, T. (2008). eQED: an efficient method for interpreting eQTL associations using protein networks. Mol. Syst. Biol. 4, 162.

44. Huang, S.S.C. & Fraenkel, E. (2009). Integrating proteomic, transcriptional, and interactome data reveals hidden components of signaling and regulatory networks. Sci Signal 2, ra40.

45. Vaske, C.J., Benz, S.C., Sanborn, J.Z., Earl, D., Szeto, C., Zhu, J., Haussler, D. & Stuart, J.M. (2010). Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26, i237–245.

46. Ng, S., Collisson, E.A., Sokolov, A., Goldstein, T., Gonzalez-Perez, A., Lopez-Bigas, N., Benz, C., Haussler, D. & Stuart, J.M. (2012). PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis. Bioinformatics 28, i640–i646.

47. Bandyopadhyay, S., Mehta, M., Kuo, D., Sung, M.-K., Chuang, R., Jaehnig, E.J., Bodenmiller, B., Licon, K., Copeland, W., Shales, M., et al. (2010). Rewiring of genetic networks in response to DNA damage. Science 330, 1385–1389.

48. Sanfeliu, A. & Fu, K.-S. (1983). A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics SMC-13, 353–362.

49. Navlakha, S. & Kingsford, C. (2010). The power of protein interaction networks for associating genes with diseases. Bioinformatics 26, 1057–1063.

S026862