technical report - cs.umn.edu · sources including functional annotations, gene expressions and...

Prioritizing Disease Genes with Label Propagation on a Heterogeneous Network

Technical Report

Department of Computer Science

and Engineering

University of Minnesota

4-192 EECS Building

200 Union Street SE

Minneapolis, MN 55455-0159 USA

TR 09-022

Prioritizing Disease Genes with Label Propagation on a

Heterogeneous Network

Taehyun Hwang and Rui Kuang

September 08, 2009

1

Prioritizing Disease Genes with Label Propagation on aHeterogeneous NetworkTaeHyun Hwang and Rui KuangDepartment of Computer Science and Engineering, University of Minnesota Twin Cities, MN, USA! E-mail: [email protected]

Abstract

Evidences from recent studies suggest that disease-causative genes can be identified more accurately fromthe modular structures in a heterogeneous network that integrates a disease phenotype similarity subnet-work, a gene-gene interaction subnetwork and a phenotype-gene association subnetwork. However, it is achallenging machine learning problem to explore a heterogeneous network comprising several subnetworkssince each subnetwork contains its own cluster structures that need to be explored independently. Weintroduce a general regularization framework and an intuitive and e!cient algorithm called MINProp forpropagating information between an arbitrary number of subnetworks in a heterogeneous network. Ouralgorithm performs label propagation on each individual subnetwork with the current label informationderived from all the subnetworks, and repeats this step until convergence to the global optimal solutionof the convex objective function in the regularization framework. In simulations, we show that MINPropcan significantly improve the ranking task by removing the biases introduced by the discrepancy amongthe subnetworks. We then tested MINProp for disease gene prioritization on a large-scale heterogeneousnetwork containing 8919 genes and 5080 OMIM phenotypes. MINProp achieved competitive or betteroverall gene ranking performance than CIPHER and random walk with restart, two best-performingmethods for disease gene prioritization, in both leave-one-out cross-validation and the case study of dis-covering new disease phenotype-gene associations added to OMIM after May 2007. We also validatedthat MINProp can specifically improve the ranking of those disease genes in the dense modules of thegene-gene interaction network. Furthermore, MINProp revealed interesting global modular structure ofhuman disease phenotype-gene associations and new associations that are only reported recently.

Author Summary

Understanding causative relations between genes and disease phenotypes has been a long standing chal-lenge in genomics research. The recent ’omics projects o"er the opportunity for a genome-wide studyof the associations with the large scale human protein-protein interaction network, phenotype similaritynetwork and their association network. Integrating the heterogenous genomic and phenotypic data andexploring the modularities among the genomic and phenotypic objects becomes the central computa-tional problem. The key challenge is two-fold: how to integrate heterogenous data and how to utilizethe modular structures. We propose to integrate the data as a large heterogenous network and intro-duce an algorithm to explore the modules in the heterogenous network. The proposed algorithm is amathematically principled and yet intuitive label propagation method. Extensive empirical experimentssupport that the algorithm can improve the accuracy of disease phenotype-gene association predictionover currently available methods. The algorithm is also a general method that is applicable to similarnetwork integration problems in other research areas. The supplementary information and source codeare available at http://compbio.cs.umn.edu/MINProp/.

Introduction

With the advent of the ’omics projects, large-scale networks such as the disease phenotype-phenotypesimilarity network [1] and the human gene-gene interaction network [2] became available for computational

2

prediction of the associations between human disease phenotypes and genes. Recent studies showed thatthere are modular structures among functionally-related genes and among similar phenotypes in thenetworks [3–5], and the modular structures can be utilized to improve disease gene prioritization [6–9].Disease gene prioritization is the task of ranking the candidate disease genes underlying each diseasephenotype to prioritize the genes for further experimental validation. Several computational models havebeen developed to integrate heterogeneous genomic data for revealing the molecular causation mechanismof human genetic diseases [3, 10–12]. ENDEAVOUR [10] is a disease gene prioritization method, whichranks the candidate genes by an order statistics that combines the rankings based on multiple datasources including functional annotations, gene expressions and literature reviews. CIPHER [12] is amore recent method for disease gene prioritization in a heterogeneous network integrating the diseasephenotype-phenotype similarity network, the gene-gene interaction network and the disease phenotype-gene network (Fig. 1A). To predict the disease genes for a query phenotype, CIPHER ranks the candidategenes based on the correlation between their gene-connectivity (number of common gene neighbors) withother phenotypes and the phenotype similarity profile of the query phenotype. CIPHER only focuses onthe second-order connections instead of the global associations among the genes and the phenotypes inthe heterogeneous network. A method based on random walk with restart was introduced to prioritizedisease genes by propagating information on a protein-protein interaction network by Khler et al. [13].This method is not customized for propagating information in the heterogeneous network, which providesa much richer collection of network information.

In this paper, we introduce a general regularization framework and an e!cient algorithm MINProp(Mutual Interaction-based Network Propagation) for propagating information between subnetworks in aheterogeneous network. Instead of being treated as part of the big network for label propagation, eachsubnetwork is explored as an independent unit. In an integrated heterogeneous network with di"erenttypes of vertices and edges in several subnetworks (Fig 1B), we define without loss of generality two typesof subnetworks in the big network: subnetworks between the same type of objects (homo-subnetwork) andbipartite subnetworks with edges connecting two types of objects (hetero-subnetwork). Simply propagat-ing the label information on the heterogenous network is not a principled method to explore the clusterstructures in the heterogenous network, since each homo-subnetwork may have its own cluster structureand each hetero-subnetwork may also have its own bicluster structures. In principle, these clusters andbiclusters should be explored independently, rather than combined together as a single structure in thecombined network. Especially, it is common that there exist biases among the subnetworks introducedby the heterogeneity in the combined network. The biases can be unbalanced sizes, di"erent noisy levelsand di"erent edge-weight scales among the subnetworks. Ignoring the biases can possibly lead to signif-icantly deteriorating performance since some of the independent cluster structures might be destroyedand became unavailable for neighborhood leveraging by label propagation.

In our regularization framework, the objective is to minimize a convex function of cost terms forsmoothness on each individual homo-subnetwork and hetero-subnetwork and the fitting to the initiallabeling. A novel label propagation algorithm MINProp is then introduced to compute the global optimalsolution to the objective function. Our algorithm performs label propagations on each individual homo-network with the current label information derived from the hetero-networks at each step and repeatsthe step until convergence. The independent network propagations at each step explores the clustersin each homo-subnetwork but uses the label information derived from the hetero-networks to capturemutual interactions (bicluster structures) between each two types of vertices. The MINProp algorithm isessentially an alternating optimization technique [14] for solving convex optimization problems, in whicha subset of variables are fixed and optimization is performed on the remaining variables in each iteration.The MINProp algorithm will finally converge on each individual network to the global optimal solutionto the convex objective function in the regularization framework.

Several label propagation algorithms have been proposed to integrate multiple homogenous networkswith the same type of vertices for di"erent learning tasks in bioinformatics [15, 16]. These methods first

3

assign weights to multiple homogenous networks to combine them, and then perform label propagation inthe combined homogeneous network for protein function prediction. Similar strategies were also appliedto combining kernel matrices for SVM classification [17,18]. These algorithms are di"erent optimizationtechniques such as QCQP and SDP for learning a weighted linear combination of multiple homogeneousnetworks with the same type of vertices. Thus, these methods are not directly applicable to integratingheterogeneous networks with di"erent types of vertices. There are also several attempts to handlinglabel propagation on heterogeneous networks in text classification [19–21]. The method by Huang etal. [19] integrates two types of objects, authors and papers, with a marginalized random walk. Dinget al. [20] generalized the method to combine two coupled random walks in the bipartite network withiterations. Ding et al. [21] introduced a more comprehensive regularization framework and an incrementalapproximate optimization technique to integrate citation, authors and venue information for documentranking. These methods were only applied to text classification problems. Compared with these methods,our method provides a di"erent general convex formulation of a unified learning framework to e"ectivelyintegrate an arbitrary number of heterogeneous units in a heterogeneous network in the context of labelpropagation. MINProp is an e!cient algorithm to compute the unique global optimal solution of theregularization function. The algorithmic form of MINProp provides an interesting interpretation of thelearning framework as utilizing mutual interactions between homo-subnetworks as initialization for eachround of label propagations.

Methods

In this section, we first review label propagation algorithms on a single network and introduce thenotations for graph-based learning on a heterogeneous network. We then describe the MINProp algorithmand develop its regularization framework.

Label Propagation on a Single Network

Various graph-based algorithms were introduced for label propagation on a similarity network [22–25].These algorithms simply propagate labels among the neighbors in the network. The propagation repeatsuntil convergence. The set of final label confidence scores on the vertices is the optimal solution tominimizing the quadratic criteria of the semi-supervised learning problem. In this paper, we will baseour algorithm on the iterative algorithm and the regularization framework proposed in Zhou et al. [25],although our algorithm can also be easily adopted to other frameworks.

Given an undirected graph G = (V,E) with vertices V and edges E, its similarity matrix W and aninitial labeling y, Zhou et al. [25] proposed the following label propagation algorithm,

1. Normalize W by computing S = D! 12 !W !D! 1

2 , where D is the diagonal matrix with Dii =!

j Wij .

2. Choose a di"usion parameter ! and perform propagation until convergency (t: time step):

f t = (1" !)y + !Sf t!1. (1)

3. The sequence f t converges to its limit f" and f" gives predictions on the unlabeled vertices in V .

This algorithm propagates the label information of every vertex to its neighbors in step 2. The propagationprocess will leverage the label scores of the vertices in a densely connected neighborhood. This algorithmoptimizes the following objective function,

#(f) = fT (I " S)f + µ # f " y #2, (2)

4

where µ = 1!!! . The first term is the smoothness constraint indicating that a good classification function

should assign similar label/activation value to strongly connected vertex pairs. The second term is thefitting constraint indicating that a good classification function should keep the new label assignmentconsistent with the initial labeling.

Heterogeneous Network

Given a heterogeneous graph G = (V,E) with k homo-subnetworks and k(k " 1)/2 hetero-subnetworks(see Fig 1B), each homo-subnetwork is defined as G(i) = (V (i), E(i)) and each hetero-subnetwork is definedas G(i,j) = (V (i)

"V (j), E(i,j)). Here, each E(i) $ V (i)%V (i) is the set of edges between vertices in homo-

subnetwork G(i) and each E(i,j) $ V (i) % V (j) is the set of edges connecting vertices in G(i) and G(j).Let W (i) denote the weight matrix of the homo-subnetwork G(i) and W (i,j) denote the weight matrix ofthe hetero-subnetwork G(i,j). We define the normalization of W (i) as S(i) = (D(i))! 1

2 W (i)(D(i))! 12 and

W (i,j) as S(i,j) = (D1(i,j))! 12 W (i,j)(D2(i,j))! 1

2 , where D(i), D1(i,j) and D2(i,j) are the diagonal matriceswith D(i)

ll =!

j W (i)lj , D1ll =

!p W (i,j)

lp and D2ll =!

p W (i,j)pl .

Next we define graph laplacians on the subnetworks in G. We introduce the graph laplacians inthe normalized form, since the unnormalized version is straightforward to derive from the normalizedversion [25]. Let the normalized graph laplacian matrix of homo-subnetwork G(i) be $(i) = I " S(i),where I is identity matrix. The normalized graph laplacian matrix %(i,j) of hetero-subnetwork G(i,j) is

defined as %(i,j) = I "#

0 S(i,j)

(S(i,j))T 0

$. Label propagation associated with the graph laplacian of

a single network through a regularization framework [26] ignores the di"erence among the subnetworksin a heterogeneous network. In a complex heterogeneous network, each subnetwork has a specific graphlaplacian that needs to be normalized and explored independently. Thus, a regularization framework onthe single network is not appropriate for label propagation on a heterogeneous network.

MINProp Algorithm

To handle label propagation on a complex heterogeneous network, MINProp sequentially performs net-work propagations on each individual homo-subnetwork with the current label information derived fromthe other homo-subnetworks and repeats this step until convergence. The MINProp algorithm performslabel propagation on the ith homo-subnetwork G(i) sequentially from i = 1 to k. The label propagationon each homo-subnetwork is the same as that in the algorithm for a single network in Equation (1), butthe initialization of the vertices in G(i) is a combination of the initial labeling of the vertices and thecurrent labeling of the vertices in the other homo-subnetworks. The labeling information on the otherhomo-subnetworks is collected as the mutual interactions through G(i,j) (i &= j), the hetero-subnetworksbetween the ith homo-subnetwork and the other homo-subnetworks. The mutual interaction informa-tion is collected as

!j #=i S(i,j)fj , where fj is the current labeling of homo-subnetwork G(j). For each

vertex in G(i), the label is initialized as the addition of its initial label score and the label of its im-mediate neighbors in the other homo-subnetworks. The introduction of the labeling information thoughthe hetero-subnetworks can capture the bicluster structures between the vertices in each pair of thesubnetworks.

The complete MINProp algorithm is described in Algorithm 2. The normalized weighted graphs ofall homo-subnetworks and hetero-subnetworks (S(i) and S(i,j)) are pre-computed as inputs. There arethree loops in the main body of the MINProp algorithm. The outer do-while-loop between line 2 and line13 checks the convergence on all the k homo-subnetworks. The convergence is defined as the sum of thesquared score change after one iteration is less than a threshold ". The second outer for-loop betweenline 4 and line 12 sequentially goes through each homo-subnetwork. The inner do-while-loop between line7 and line 10 performs label propagation on each homo-subnetwork, similar to the algorithm in Zhou et

5

al. [25]. In line 6, the mutual information is collected from the vertices in the other homo-subnetworks.The iterative propagation step at line 9 can be rewritten as

f ti = (1" !i)(

1" k!i

1" !iyi +

!i

1" !i

%

j

S(i,j)fj) + !iS(i)f t!1

i . (3)

Equation (3) is equivalent to the propagation step in equation (1), if y = 1!k!i1!!i

yi + !i1!!i

!j S(i,j)fj in

equation (1). Thus, this step can be thought of as label propagation on a single network with an enrichedinitialization from the hetero-subnetworks and the proof of convergence is identical to the proof in Zhouet al. [25]. Finally, the sequence f t

i converges to its limit f"i and f"i gives the class labels/activation valueson the vertices in all subnetworks.

Algorithm 1 MINPropInputk: number of homo-subnetworks": convergence thresholdy1, y2, ..., yk: vectors of initial label values on the vertices!1, !2, ...,!k: di"usion parametersS(1), S(2), ..., S(k): weight matrices of homo-subnetworksS(1,2), ..., S(k!1,k): weight matrices of hetero-subnetworksOutputf1, f2, ..., fk: vectors of final label values on the vertices

1: fi = 0 for i = 1...k;2: do3: fold

i = fi for i = 1...k;4: for i = 1...k5: t = 0, f0

i = 0;6: y$ = 1!k!i

1!!iyi + !i

1!!i

!j #=i S(i,j)fj ;

7: do8: t = t + 1;9: f t

i = (1" !i)y$ + !iS(i)f t!1i ;

10: while(# f ti " f t!1

i #> ");11: fi = f t

i ;12: end for13: while ('i s.t. # fi " fold

i #> ");14: return f1, f2, ..., fk;

The runtime for calculating the initial label values on homo-subnetwork G(i) = (V (i), E(i)) (line 6) isO(|V (i)|

!j #=i |V (j)|). The runtime of the inner do-while-loop between line 7 and 10 is O(ti|V (i)|2), where

ti is the number of iterations to reach convergence. These two steps will repeat on each homo-subnetwork(line 4 to line 12). Thus, the time complexity is O(

!i(|V (i)|

!j #=i |V (j)| + ti|V (i)|2)) for finishing one

round of propagations on each homo-subnetwork. A more e!cient but less intuitive implementation ofMINProp (see Text S1) can pre-compute (I " !iS(i))!1 and the inner do-while-loop between line 7 and10 only needs to compute (1" !i)(I " !iS(i))!1y$ [25]. Let t be the total number of iterations to reachthe convergence. The total time complexity of the e!cient implementation is O(t|V |2 +

!i ti|V (i)|2).

The actual running time of MINProp crucially depends on the tis, the convergence rate of propagationson each homo-subnetwork and the t, the number of iterations going through all the homo-subnetworks.The convergence rate of label propagation on each homo-subnetwork closely relates to the property

6

of the spectrum of the graph [26]. In practice, it converges within tens of iterations even on largenetworks [27–30]. The iterations going through the homo-subnetworks needed for convergence can beestimated by the theory of alternating optimization [14]. In the next section, we will show that theMINProp algorithm is essentially an alternating optimization algorithm that e!ciently calculates theclosed-from solution of a convex objective function.

Regularization Framework

A natural regularization framework for learning on a heterogeneous network G = (V,E) is given asfollows,

#(f) =k%

i=1

(fTi $(i)fi + µi # fi " yi #2) +

12

k!1%

i=1

k%

j=i+1

µij [fTi fT

j ]%(i,j)

#fi

fj

$, (4)

where f ( R|V | is the label variable, and µi and µij are positive constants that balance the cost termsin the objective function. The cost term fT

i $(i)fi is the smoothness constrain on homo-subnetworkG(i) = (V (i), E(i)) that enforces a consistent labeling of the strongly connected vertices in V (i). The costterm # fi"yi #2 is a fitting term which keeps the final label values consistent with the initial labels. If weput the above two cost terms together, each fT

i $(i)fi+ # fi " yi #2 is exactly the objective function forlearning on G(i) identical to the objective function for learning on a single network in equation (2). The

last additional cost term [fTi fT

j ]%(i,j)

#fi

fj

$on the second line is the same smoothness constraint on the

hetero-subnetwork G(i,j) = (V (i)"

V (j), E(i,j)) that enforces a consistent labeling between the stronglyconnected vertex pairs in V (i)

"V (j). The normalized graph laplacian %(i,j) is defined on the bipartite

graph G(i,j) [29]. For simplicity of analysis and implementation, we set all the µij to be 1, assuming thatthe homo-subnetworks and hetero-subnetworks are equally informative.

Proposition 1. #(f) is strictly convex.

Proof. Since #(f) is a non-negative-weighted sum of convex terms including graph laplacians and sumsof squares, #(f) is clearly convex. If we take the second derivative of #(f), we obtain a Hessian matrixthat is also a non-negative-weighted combination of the convex terms plus a diagonal matrix with positivediagonal entries µ1, µ2, ..., µk. Therefore, the Hessian of #(f) is positive definite. Hence, #(f) is strictlyconvex.

Lemma 1. The optimal solution to the alternating optimization step on each fi in the objective function#(f) is f"i = (1" !i)(Ii " !iS(i))!1( 1!k!i

1!!iyi + !i

1!!i

!j S(i,j)fj).

Proof. By Proposition 1, #(f) is strictly convex. Thus, it can be minimized with alternating optimization[14]. Specifically, for each fi, we fix the fj for all j ( {j|1 ) j ) k, j &= i} and then di"erentiate #(f)with respect to fi to compute the closed-form solution f"i for minimizing #(f). The detailed derivationis given in the Text S2. If we take partial derivative of #(f) with respect to fi and set it to zero, we canobtain the closed-from solution f"i ,

f"i = (1" !i)(Ii " !iS(i))!1(

1" k!i

1" !iyi +

!i

1" !i

%

j

S(i,j)fj).

where !i = 1/(k+µi). This concludes the proof. Note that Ii"!iS(i) is positive definite since !i ( (0, 1)and the largest eigenvalue of S(i) is 1.

7

Proposition 2. The MINProp algorithm minimizes the objective function #(f) of the regularizationframework.Proof. We have showed that the iteration step between line 4 and line 11 computes the propagationoperation defined by equation (3), which is identical to equation (1). Follow the proof by Zhou et al. [25],we can show that the sequence f (t)

i converges to f"i = (1"!i)(Ii"!iS(i))!1( 1!k!i1!!i

yi + !i1!!i

!j S(i,j)fj).

By Lemma 2, we conclude that the iteration step computes the optimal solution to the alternat-ing optimization of each fi in minimizing #(f). The two outer loops in Algorithm 2 fixes fj for allj ( {j|1 ) j ) k, j &= i} to find the optimal fi sequentially and repeat until converge. Algorithm 2exactly performs alternating optimization for minimizing #(f). Thus, MINProp essentially minimizesthe objective function of the regularization framework.

Results

Simulations

To show that MINProp can remove biases introduced from network heterogeneity by exploring the inde-pendent cluster structures in each subnetwork, we compared MINProp and Random Walk with restart [25]in artificial heterogeneous networks with various settings. We first generated heterogeneous networks withthree homo-subnetworks (A, B and C) and three hetero-subnetworks (A-B, A-C and B-C) as plottedin Fig 2A. Each homo-subnetwork has three densely connected clusters (e.g. A1, A2 and A3 in homo-subnetwork A). With probability 0.8, two vertices in the same cluster are connected. Globally, theclusters numbered the same in the three homo-subnetworks are assumed to have the true associations,i.e. there are three global clusters A1

"B1

"C1, A2

"B2

"C2 and A3

"B3

"C3. The vertices in the

same global cluster but di"erent homo-subneworks are connected with probability 0.7. To mimic thenoisy nature of real networks, random edges connecting vertices in di"erent global clusters are generatedwith probability 0.3.

In the simulation, the task is to retrieve the true associations defined by hetero-subnetwork A-B.Specifically, we use each vertex in the homo-subnetwork A as a query and remove the direct links betweenthe query vertex and the vertices in homo-subnetwork B to test whether a method can rank the verticesin the cluster associated with the query vertex in homo-subnetwork B high. For example, if we querywith a vertex in cluster A1, a good method should rank the vertices in cluster B1 above the verticesin B2 and B3. Both algorithms were tested on 50 randomly generated heterogeneous networks. In theexperiment on each heterogeneous network, we perform leave-one-out cross-validation. We calculated theAUC scores from the ranking of the vertices and reported the average AUC scores across all the queries.

Biased edge-weight scales

We fixed the size of the three homo-subnetworks, and varied the edge weights between 1 to 9 in the homo-subnetwork C while assigning constant weight 1 to all other edges. As shown in Fig 2B, when all thesubnetworks have the same edge weight 1, MINProp and Random Walk perform similarly well. However,when larger weighting scales are introduced into homo-subnetwork C, the performance of Random Walkdeteriorates quickly. The plot indicates that label propagation on a network with heterogeneous edges ofdi"erent weighting scales can possibly destroy the cluster structures in the subnetworks. But MINPropcan properly avoid the scaling problem by performing independent label propagation on each homo-subnetwork.

Unbalanced subnetwork sizes

In this experiment, all the edge weights are set to 1 and the size of subnetwork A and B are set to30. We started with testing the algorithms with three homo-subnetworks of the same size, and then

8

gradually increased the size of homo-subnetwork C. Interestingly, it is clear in Fig 2C, as the size ofhomo-subnetwork C grows, the performance of Random Walk gets worse while MINProp performs stablyand robustly well for all the sizes. The result indicates that if the heterogeneous network consists ofunbalanced subnetworks, the cluster structures in the larger subnetworks are dominating. Thus, thecluster structure in the smaller subnetworks can diminish.

Noisy subnetworks

We fixed the edge weights and sizes of the three homo-subnetworks, and then introduced di"erent per-centage of noisy edges in the hetero-subnetworks A-C and B-C. The noisy edges are introduced as therandom connections between the vertices in subnetwork C and the vertices in subnetwork A and B.Fig 2D shows that when all the subnetworks are equally informative, the performance di"erence betweenthe two algorithms is small. However, MINProp clearly outperforms Random Walk when the percentageof noisy edges is high. The plot shows that MINProp is more robust in handling noisy subnetworks thanRandom Walk.

Convergence rate and running time

To test the convergence and the scalability, we measured the convergence rate and the running timeof MINProp on artificial heterogenous networks of two homo-subnetwork and one hetero-subnetworkgenerated in the same setting with the previous simulations. The convergence is defined as the maximumchange of activation values over all the graph nodes being smaller than 1e" 9. Nine combinations of the!i parameters are tested. The running time and the number of iterations for reaching convergence arereported in Fig 3. For all the choices of the parameters, MINProp converged within 16 iterations and thenumber of iterations are similar on the heterogenous networks of di"erent sizes (Fig 3B). The runningtime almost scales quadratically in the size of the heterogenous network (Fig 3A).

Experiments of Disease Gene Prioritization

In the experiments of disease gene prioritization, the heterogenous network includes a gene-gene interac-tion subnetwork, a disease phenotype similarity subnetwork, and a phenotype-gene association subnet-work (Fig 1A). With a similar setting in Wu et al. [12], the task of disease gene prioritization is definedas querying with a disease phenotype to rank the candidate disease genes in the gene-gene interactionsubnetwork (Fig 4). MINProp was tested in both a leave-one-out cross-validation and a case study ofdiscovering new disease phenotype-gene associations added to OMIM after May 2007. We comparedMINProp with Random Walk with restart [13], and CIPHER DN (Direct Neighbors) or SP (ShortestPath) [12], two of the-state-of-the-art algorithms that also explore network information for disease geneprioritization. The performance was evaluated by AUC scores calculated on the ranking of candidatedisease genes by each method. We report the average and the pairwise win/draw/loss comparison of theAUC scores of all the queries. The pairwise win/draw/loss comparison is counting the number of timesthat a method wins over or draws a tie with or loses to another method in all the queries.

Data Preparation

The three subnetworks in the heterogeneous network were prepared. The gene-gene interaction networkwas derived from the human protein-protein interaction (PPI) network introduced by Wu et al. [12]. ThePPI network contains 34,364 binary-valued undirected interactions between 8919 human proteins (genes).The information of phenotypes and disease genes was extracted from OMIM database (Version May-2007).The disease phenotype similarity network is an undirected graph with 5080 OMIM disease phenotypevertices [31]. The edges are weighted by the pairwise quantitative measurements of the phenotypicoverlap in text and clinical synopsis of OMIM records calculated by text mining techniques [32]. The

9

phenotype-gene association network is an undirected bipartite graph with disease phenotype vertices andgene vertices. Binary edges connect 1126 disease phenotype vertices and 916 gene vertices based on theassociations in OMIM [31].

Experiment Design

The task of disease gene prioritization is, for a given query phenotype, to rank the disease genes associatedwith the phenotype among a set of control genes. The optimal performance will be ranking the diseasegenes associated with the phenotype above all the control genes. We designed two experiment settingswith di"erent sets of control genes: all other genes (including both the other disease genes and thenon-disease genes) or the non-disease genes. The non-disease genes are the genes that have not knownassociation with any disease phenotypes.

Setting 1: Uncovering associations with known disease genes

The first setting is to discover new phenotype-gene associations between the query phenotype and allthe genes including both the non-disease genes and the disease genes (Fig 4A). This experiment is acomprehensive evaluation of how well a method can discover phenotype-gene associations for a given newdisease phenotype without any known disease genes. Specifically, we want to discover novel associationsbetween a query phenotype and all the genes including the disease genes that already have associationswith the other disease phenotypes. Thus, in the leave-one-out cross-validation, we only removed thedirect link between the query phenotype and its disease genes but kept the links between its diseasegenes and other phenotypes. The disease genes of the query phenotypes were then ranked among all thegenes including both disease genes and non-disease genes. The learning task is to rank the disease genesof the query phenotype as high as possible above all other genes in the rank list.

Setting 2: Discovering associations with new disease susceptibility genes

The second setting is to examine how well a method can discover the disease susceptibility genes of eachquery phenotype from the non-disease genes (Fig 4B). The 8003 non-disease genes among the 8919 geneswere selected as the control genes. In leave-one-out cross-validation of querying with a disease phenotype,we removed the links between its disease genes and all the phenotypes including the query phenotype.This is equivalent to adding the disease genes of the query phenotype as unknown disease genes to the8003 non-disease genes without any selection bias. The learning task is to rank the disease genes of thequery phenotypes as high as possible above the non-disease genes in the rank list.

In the first setting, we focus on finding missing phenotype-gene associations between phenotypes andgenes. The target genes are allowed to keep their associations with other phenotypes. Therefore, theywill compete against other disease genes in the rank list. In the second setting, we focus more on findingnew disease susceptibility genes from non-disease genes. Thus, the target genes are not linked with anyphenotypes, and the question is whether we can discover this type of hidden disease genes (the diseasegenes of the query phenotype) without knowing any associations between the target genes and all diseasephenotypes. The purpose of introducing the two di"erent settings is to give a comprehensive evaluationof the methods in scenarios with di"erent network connectivities on the target genes.

Performance of Ranking Disease Genes in Leave-one-out Cross-validation

The 5080 disease phenotypes and the 1126 phenotype-gene associations extracted from OMIM versionMay-2007 were tested in leave-one-out cross-validation under the two settings. We performed leave-one-out cross validation by holding-out one query phenotype for testing at a time. The ranking performancesof all the methods in the two settings are measured by AUC scores calculated based on the ranking of

10

the true disease genes among the control genes determined by each method. One limitation of CIPHERDN is that only genes with at least one disease gene in its direct neighbors can be ranked with the othergenes. Thus, for some of the query phenotypes, its disease genes cannot be ranked by CIPHER DN inleave-one-out cross-validation. After filtering of the query phenotypes with causative disease genes thathave no direct neighbor to the other disease genes, 858 disease phenotypes were left for evaluation in thecomparison with CIPHER DN.

The average AUCs of gene ranking across all the queries by MINProp, Random Walk with restart andCIPHER are reported in Table 1. In the first setting, MINProp outperformed CIPHER DN by 12.5%,CIPHER SP by 7.1% and Random Walk by only 0.8%. In the second setting, MINProp outperformedRandom Walk by 8% and CIPHER DN by 7.3%, and achieved a tie with CIPHER SP. The pairwisecomparisons of the AUC scores in Table 1 suggest that MINProp improved the ranking of more querycases compared with the other methods, except the comparison with CIPHER SP in the first setting. Amore detailed examination of the individual query cases is given in Fig 5. Interestingly, MINProp canwin over Random Walk in about 66% and 93% queries in the two experiments but only by a relativesmall margin — most of the dots in the scatter plot in Fig 5A and Fig 5D are just below the diagonalline. Since both methods are exploring the global structures in the network, it is not supervising that theresults are strongly correlated. But clearly, by exploring the independent structures in the subnetworks,MINProp is able to improve most of the predictions. In the comparison with CIPHER DN in Fig 5Band Fig 5E, MINProp and CIPHER DN produce quite di"erent results in most of the queries. MINPropimproved on many hard cases, on which CIPHER DN and SP performed poorly. This suggests that formany cases, it is not enough to just check the second-order neighbors of the genes, and it is importantto explore the global structures.

MINProp achieved the best overall performance in the experiments. The explanation of the tiebetween MINProp and CIPHER SP in the second setting needs more analysis. CIPHER SP also exploresthe global structure in the gene-gene interaction network since the gene-gene connection was evaluatedby shortest paths, which measure remote interaction between the genes in the network. In the secondsetting, because no target genes are linked with the phenotypes, the cluster structures in the phenotypesimilarity network only have small influence. Thus, the ranking of the disease genes relies more on thebi-clusters between the genes and the phenotypes. In this case, CIPHER SP can possibly perform betteron the disease genes that are directly connected to other disease genes, while MINProp might dilutethis direct information with neighborhood averaging. To show that MINProp and CIPHER SP canbe complementary to each other, we further combined the gene ranking produced with MINProp andCIPHER SP by averaging the ranks of each gene in the two lists. A 3% improvement is observed in thehybrid case (last column in Table 4).

Case Study: Discovering New OMIM Associations Added after May 2007

To test the real performance of discovering new disease phenotype-gene associations, a case study wasalso conducted to predict the associations added into OMIM between May, 2007 and July, 2009. Therecent disease phenotype-gene associations from OMIM in July, 2009 were downloaded and compared tothe associations in the May-2007 version. There are 462 new associations involving 374 new phenotypesin the July-2009 version. Out of the 462 new associations, 95 associations are between the known diseasegenes and the phenotypes, and the other 367 are between the phenotypes and newly discovered diseasegenes, which were among the non-disease genes in the May-2007 version. We performed experiments inthe two settings with di"erent sets of control genes: non-disease genes or all genes including the otherdisease genes in the May-2007 version, and measured AUC scores based on the ranking of the true diseasegenes among the control genes. Note that we only compared the performance of MINProp with RandomWalk with restart and CIPHER-SP because CIPHER DN is not capable of ranking the genes at genome-wide scale experiments as discussed in the previous section. Twenty phenotypes are associated withnew disease genes in both settings (among either previously known disease genes or non-disease genes).

11

Thus, experiments were performed in both settings for the 20 phenotypes. The average AUCs of generanking across all queries by MINProp, Random Walk and CIPHER are reported in Table 2. In the firstsetting, MINProp outperformed CIPHER SP by 28.2% and Random Walk by 5.7%. In the second setting,MINProp outperformed CIPHER SP by 4.9% and Random Walk by 2.0%. The querywise comparisonsof the AUC scores in Table 2 suggest that MINProp improved the ranking of more query cases thanthe other methods in both experiments. Fig 6 clearly shows that MINProp outperformed both RandomWalk and CIPHER SP. Specifically, MINProp wins over Random Walk and CIPHER SP in about 70%and 100% of the queries in the first setting. In the second setting, MINProp wins over Random Walk inabout 68% of the cases by a small margin, and CIPHER SP in about 70% of the cases including manyhard queries, on which CIPHER SP performed poorly.

MINProp ranked most of the new disease genes of the phenotypes of rare cancers, syndromes anddisorders higher than Random Walk and CIPHER SP. The ranking of the new disease genes of thesephenotypes are listed in Table 3. For example, pheochromocytoma (MIM:171300), caused by mutationsin genes VHL, GDNF, K1F1B and RET, is a rare neuroendocrine tumor with missing diagnosis results indevastating cardiovascular complications or death [33]. MINProp ranked VHL at 1st, K1F1B at 228th,and GDNF at 16th in the ranking among the control genes. Some other successful examples are NF2 infamilial meningioma (MIM:607174), TP53 in adrenocortical cacinoma (MIM:202300), and WAS in severeneutro neutropenia (MIM:300299). Random Walk and CIPHER SP failed to rank these genes at the topof the rank lists.

Exploring the Modularity of Genes

To evaluate how e"ective the methods can explore the modular structures in the gene-gene interactionnetwork, we analyzed the ranking results of the disease genes with their involvement in the modularstructures of the gene-gene interaction network in the second setting. The modularity of the disease genesof each query phenotype is measured by their clustering coe!cients in the gene-gene interaction network.Large averages of clustering coe!cients of the disease genes indicate high modularity-involvement of thedisease genes. In Table 4, we compare the ranking performance on the query phenotypes with respectto di"erent levels of gene clustering coe!cients. Clearly, MINProp outperforms the baselines on the testqueries of higher gene clustering coe!cients. The observation supports the hypothesis that the clusters ineach subnetwork should be explored independently as MINProp dose rather than as one network. SinceCIPHER DN also utilizes the modular structure of genes by counting gene neighbors, the trend of largerclustering coe!cients suggesting larger performance di"erence between MINProp and CIPHER DN isnot obvious. But overall, MINProp significantly improved disease gene ranking compared to CIPHERDN in all ranges of gene clustering coe!cients. The analysis indicates that MINProp is more capable ofe"ectively utilizing the modular structures in the gene-gene interaction network, and our framework mightbe a more robust and powerful method for the gene prioritization task on the heterogeneous network.Finally, the hybrid of MINProp and CIHPER SP can boost the performance of MINProp on the queriesof small gene clustering coe!cients to be similar to CIPHER SP.

Inferring Novel Phenotype-gene Causative Relations in Modules

To test whether MINProp can infer modules of phenotype-gene causal relations, we queried the phenotype-gene association network by all the disease phenotypes one at a time and produced a pairwise similaritymatrix between the 5080 phenotypes and 8919 genes. The matrix was then bi-clustered to discovermodules of disease phenotype-genotype associations. The bi-clustered matrix is shown in Fig 7. Weapplied Ingenuity (http://www.ingenuity.com/) to annotate the phenotype clusters and gene clusters ofenriched disease categories and Gene Ontology terms. We found many interesting associations betweenthe phenotypes and genes in the same bi-cluster module. One example is the module A in Fig 7,which contains a set of phenotypes associated with neurological diseases including Alzheimer, Parkinson

12

and Dementia diseases. Previous studies discovered genetic associations in the pathogenesis of theseneurological diseases [34]. Diabetes, which is not a neurological disease, is also included in the cluster.Interestingly, this inclusion is consistent with a recent study, which showed that neurological disease areassociated with diabetes [34]. We also identified recently discovered susceptive genes, TRAK2, MYH13,PRNP, MAPT and CHRNB2 as well as the well known APOE, of Alzheimer disease in the gene clusterassociated with the neurological diseases. These identified genes do not have associations with Alzheimerdisease in the OMIM database. These findings suggest that exploring independent cluster structures ofdisease phenotype and gene-gene networks can reveal new causal relations between disease phenotypesand genes.

Discussion

We propose MINProp, a general graph-based learning algorithm, for label propagation on heterogeneousnetworks for disease gene prioritization. We showed in the experiments that MINProp can e"ectivelyexplore independent cluster structures in each subnetwork to remove the biases introduced by the hetero-geneity of a complex network of several di"erent subnetworks. MINProp algorithm iteratively performslabel propagation (semi-supervised learning) on each homo-subnetwork independently. Thus, it shouldnot be confused with simple normalization techniques to remove the scale bias among the subnetworkssince the regularization framework of MINProp explicitly seeks for a balanced cost of labeling all subnet-works, and MINProp is an intuitive and straight-forward label propagation algorithm for minimizing theobjective function of the regularization framework.

The time and space complexity of MINProp is in the same order as other standard graph-based learn-ing algorithms – quadratic in the number of vertices multiplied by the number of iterations. Scalabilityis still a limitation since heterogeneous networks typically contain more vertices in several subnetworksand thus, requirement on time and space complexity is often more stringent. One strategy is to usesparse implementation for the networks, given the sparseness of some of the subnetworks. In our previ-ous studies, we have applied such sparse implementations to model networks of size 100,000 by 100,000with a great speed [28, 35, 36]. Thus, we believe that the technique is also directly applicable to thisproblem. Another limitation is the di!culty in tuning the hyper-parameters for balancing the subgraphlaplacians. A commonly used method for parameter tuning is cross-validation. However, the size of mostgenomic datasets is too small to keep the cross-validation result stable. Thus, we also aim to apply otheroptimization techniques allowing quadratic constraints to learn the hyper-parameters [16,17,37,38].

With the advent of more high-throughput bio-technologies, there are many challenging problemsrequiring integration of large scale genomic datasets in biomedical informatics and bioinformatics appli-cations. MINProp is a general, robust and e!cient algorithm for data integration in these applications.

Supplementary Information

E!cient MINProp algorithm

The e!cient MINProp algorithm directly computes &i = (I " !iS(i))!1 at line 2 and use it within thefor-loop to compute the closed-form solution for label propagation on each homo-subnetwork (line 8) .Let t be the total number of iterations for the convergence of MINProp and ti is the number of time stepsto reach convergence on homo-subnetwork Gi. The runtime for label propagation on homo-subnetworkGi, O(ti|V (i)|2), is only one time. Thus, the total time complexity of e!cient MINProp is

O(t%

i

(|V (i)|%

j #=i

|V (j)| + |V (i)|2) + ti|V (i)|2) = O(t|V |2 +%

i

ti|V (i)|2). (5)

13

Algorithm 2 E!cient MINPropInputk: number of homo-subnetworks": convergence thresholdy1, y2, ..., yk: vectors of initial label values!1, !2, ...,!k: di"usion parametersS(1), S(2), ..., S(k): weight matrices of homo-subnetworksS(1,2), ..., S(k!1,k): weight matrices of hetero-subnetworks

Outputf1, f2, ..., fk: vectors of final label values1: fi = 0 for i = 1...k;2: &i = (I " !iS(i))!1 for i = 1...k;3: do4: fold

i = fi for i = 1...k;5: for i = 1...k6: t = 0, f0

i = 0;7: y$ = 1!k!i

1!!iyi + !i

1!!i

!j #=i S(i,j)fj ;

8: fi = (1" !i)&iy$;9: end for

10: while ('i s.t. # fi " foldi #) > ");

11: return f1, f2, ..., fk;

Proof of Lemma 1

Lemma 2. The optimal solution to the alternating optimization step on each fi in the objective function#(f) is f"i = (1" !i)(Ii " !iS(i))!1( 1!k!i

1!!iyi + !i

1!!i

!j S(i,j)fj).

Proof. By Proposition 1, #(f) is strictly convex. Thus, it can be minimized with alternating optimization(Bezdek and Hathaway, 2003). Specifically, for each fi, we fix the fj for all j ( {j|1 ) j ) k, j &= i} andthen di"erentiate #(f) with respect to fi to compute the closed-form solution f"i for minimizing #(f).We take partial derivative of #(f) with respect to fi and set it to zero,

###fi

= (Ii " S(i))f"i + µi(f"i " yi) + (k " 1)f"i "k%

j

S(i,j)fj = 0

(Ii " S(i))f"i + ((k " 1) + µi)f"i = µiyi +k%

j

S(i,j)fj

f"i "1

(k + µi)S(i)f"i =

1(k + µi)

(µiyi +k%

j

S(i,j)fj)

Let !i = 1/(k +µi) and after rearrangement, the closed-from solution f"i can be computed as follows,

(Ii " !iS(i))f"i = (1" k!i)(yi +

!i

(1" k!i)

k%

j

S(i,j)fj)

f"i = (1" !i)(Ii " !iS(i))!1(

1" k!i

1" !iyi +

!i

1" !i

%

j

S(i,j)fj).

14

This concludes the proof. Here, Ii"!iS(i) is positive definite since 0 < !i < 1 and the largest eigenvalueof S(i) is 1.

Parameter selection

In the experiments, we performed leave-one out cross validation to pick the best parameters for all themethods. The parameters used in the experiments for MINProp are !i = {0.1, 0.2, 0.3, 0.4, 0.47} fori = 1, 2, and µij is fixed as 1.0 for all experiments. The parameters used in the experiments for theRandom Walk are ! = {0.1, 0.3, 0.5, 0.7, 0.9}. Interestingly, the best parameters of MINProp are stablein all the experiments, which are !1 = 0.1 and !2 = 0.47. The best parameter for Random Walk is! = 0.1.

References

1. Freimer N, Sabatti C (2003) The human phenome project. Nat Genet 34: 15-21.

2. Ideker T, Sharan R (2008) Protein networks in disease. Genome Res 18: 644-652.

3. Oti M, Brunner H (2007) The modular nature of genetic diseases. Clin Genet 71: 1–11.

4. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, et al. (2007) The human disease network. ProcNatl Acad Sci USA 104: 8685–8690.

5. Wagner G, Pavlicev M, Cheverud J (2007) The road to modularity. Nat Rev Genet 8: 921-931.

6. Oti M, Snel B, Huynen MA, Brunner HG (2006) Predicting disease genes using protein-proteininteractions. J Med Genet 43: 691-698.

7. Jiang X, Liu B, Jiang J, Zhao H, Fan M, et al. (2008) Modularity in the genetic disease-phenotypenetwork. FEBS Lett 582: 2549-2554.

8. Ala U, Piro RM, Grassi E, Damasco C, Silengo L, et al. (2008) Prediction of human disease genesby human-mouse conserved coexpression analysis. PLoS Comput Biol 4: e1000043.

9. Lee E, Chuang HY, Kim JW, Ideker T, Lee D (2008) Inferring pathway activity toward precisedisease classification. PLoS Comput Biol 4: e1000217.

10. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, et al. (2006) Gene prioritization throughgenomic data fusion. Nat Biotechnol 24: 537–544.

11. Oti M, Huynen M, Brunner H (2008) Phenome connections. Trends Genet 24: 103–106.

12. Wu X, Jiang R, Zhang MQ, Li S (2008) Network-based global inference of human disease genes.Mol Syst Biol 4.

13. Khler S, et al. (2008) Walking the interactome for prioritization of candidate disease genes. TheAmerican J of Hum Genet 82: 949 - 958.

14. Bezdek JC, Hathaway RJ (2003) Convergence of alternating optimization. Neural, Parallel SciComput 11: 351–368.

15. Tsuda K, Shin H, Scholkopf B (2005) Fast protein classification with multiple networks. Bioinfor-matics 21: 59-65.

15

16. Kato T, Kashima H, Sugiyama M (2008) Integration of multiple networks for robust label propa-gation. In: SDM. SIAM, pp. 716-726.

17. Lanckriet GRG, Cristianini N, Jordan MI, Noble WS (2004) Kernel-based integration of genomicdata using semidefinite programming. In: Kernel Methods in Computational Biology. pp. 209-231.

18. Ye J, Chen K, Wu T, Li J, Zhao Z, et al. (2008) Heterogeneous data fusion for alzheimer’s diseasestudy. In: KDD. ACM, pp. 1025–33. doi:http://doi.acm.org/10.1145/1401890.1402012.

19. Huang J, Zhu T, Rereiner R, Zhou D, Schuurmans D (2006) Information marginalization on sub-graphs. In: PKDD. pp. 199-210.

20. Zhou D, Orshanskiy SA, Zha H, Giles CL (2007) Co-ranking authors and documents in a het-erogeneous network. In: ICDM. Washington, DC, USA: IEEE Computer Society, pp. 739–744.doi:http://dx.doi.org/10.1109/ICDM.2007.57.

21. Zhou D, Zhu S, Yu K, Song X, Tseng BL, et al. (2008) Learning multiple graphs for documentrecommendations. In: WWW. pp. 141–150. doi:http://doi.acm.org/10.1145/1367497.1367517.

22. Szummer M, Jaakkola T (2001) Partially labeled classification with markov random walks. In:NIPS. pp. 945–952.

23. Zhuxj XZ, Ghahramani Z, La"erty J (2003) Semi-supervised learning using gaussian fields andharmonic functions. In: ICML. pp. 912–919.

24. Belkin M, Niyogi P (2003) Using manifold stucture for partially labeled classification. In: S BeckerST, Obermayer K, editors, NIPS. pp. 929–936.

25. Zhou D, Bousquet O, Lal TN, Weston J, Schlkopf B, et al. (2004) Learning with local and globalconsistency. In: NIPS. volume 16, pp. 321-328.

26. Bengio Y, Delalleau O, Le Roux N (2006) Label propagation and quadratic criterion. In: ChapelleEO, et al., editors, Semi-Supervised Learning, MIT Press.

27. Weston J, Elissee" A, Zhou D, Leslie CS, Noble WS (2004) Protein ranking: from local to globalstructure in the protein similarity network. Proc Natl Acad Sci USA 101: 6559–63.

28. Kuang R, Weston J, Noble WS, Leslie C (2005) Motif-based protein ranking by network propaga-tion. Bioinformatics 21: 3711-3718.

29. Hwang T, Sicotte H, Tian Z, Wu B, Kocher JPao (2008) Robust and e!cient identification ofbiomarkers by classifying features on graphs. Bioinformatics 24: 2023–2029.

30. Xu Y, Zhou X, Zhang W (2008) Microrna prediction with a novel ranking algorithm based onrandom walks. Bioinformatics 24: i50-8.

31. Hamosh A, Scott A, Amberger J, Valle D, McKusick V (2000) Online mendelian inheritance inman (OMIM). Hum Mutat 15: 57-61.

32. van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA (2006) A text-mining analysisof the human phenome. Eur J Hum Genet 14: 535–542.

33. WILLIAM M MANGER pPtAyuhpnda journal = Annals of the New York Academy of Sciences .

34. Bertram L, Tanzi RE (2008) Thirty years of alzheimer’s disease genetics: the implications ofsystematic meta-analyses. Nat Rev Neurosci 9: 768–778.

16

35. Noble W, Kuang R, Leslie C, Weston J (2005) Identifying remote protein homologs by networkpropagation. FEBS J 272: 5119-5128.

36. Weston J, Kuang R, Leslie C, Noble W (2006) Protein ranking by semi-supervised network prop-agation. BMC Bioinformatics 7 Suppl 1: S10.

37. Argyriou A, Herbster M, Pontil M (2005) Combining graph laplacians for semi-supervised learning.In: NIPS. MIT Press, pp. 67–74.

38. Wang F, Zhang C (2006) Label propagation through linear neighborhoods. In: ICML06, 23rdInternational Conference on Machine Learning.

17

Figure Legends

g1

g2

g3

g4

g5

g6 g7

g8

g9

g10 g11

g12

Disease Phenotype

Disease Gene

Non-Disease Gene

Disease Phenotype Similarity Network

Gene-Gene Interaction Network

Disease Phenotype-Gene Network

Homo-subnetwork i

Homo-subnetwork j Homo-subnetwork k

Hetero-subnetwork ij

Hetero-subnetwork jk

Hetero-subnetwork ik

A B

A heterogeneous disease-gene network A heterogeneous network with three types of vertices

Figure 1. Illustration of heterogeneous networks. (A) A disease-gene network contains two typesof objects: disease phenotypes and genes. There are three subnetworks defined on the diseasephenotype vertices and gene vertices: the disease phenotype-phenotype similarity network, thegene-gene interaction network and the disease phenotype-gene network. (B) An example of aheterogeneous network with three types of vertices. There are three homo-subnetworks and threehetero-subnetworks in this heterogeneous network.

18

!"#

!$#

!%#

&$#

&%#

'$# '%#

!"(&"#

!$(&$#

!%(&%#

!"('"# &"('"#

!$('$#

&$('$#

!%('%#&%('%#

)*+*(,-./012*34#!# )*+*(,-./012*34#&#

)*+*(,-./012*34#'#

&"#

'"#

A B C D

0 30 150 300 600 900 12000.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Size of Network C

Av

era

ge

AU

C

MINProp

Random Walk

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.84

0.85

0.86

0.87

0.88

Percentage of Noisy Edges in Network C

Av

era

ge

AU

C

MINProp

Random Walk

0 1 3 5 7 90.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Edge!Weight Scale in Network C

Av

era

ge

AU

C

MINProp

Random Walk

Weighting scale of edges Unbalanced subnetworks Noisy subnetwork The heterogeneous-network topology

Figure 2. Simulation results. (A) The structure of the heterogenous networks used in thesimulations. There are three homo-subnetworks and three hetero-subnetworks in the heterogenousnetwork. The three global clusters (A1

"B1

"C1, A2

"B2

"C2 and A3

"B3

"C3) contain cluster

members across the three homo-subnetworks. (B)-(D) Ranking performance on heterogenous networkswith di"erent biases.

19

A B

0 1000 2000 3000 4000 5000 6000 70000

10

20

30

40

50

60

70

80

Number of Nodes

Tim

e (

se

co

nd

s)

!1 = 0.1, !

2 = 0.1

!1 = 0.1, !

2 = 0.3

!1 = 0.1, !

2 = 0.45

!1 = 0.3, !

2 = 0.1

!1 = 0.3, !

2 = 0.3

!1 = 0.3, !

2 = 0.45

!1 = 0.45, !

2 = 0.1

!1 = 0.45, !

2 = 0.3

!1 = 0.45, !

2 = 0.45

0 1000 2000 3000 4000 5000 6000 70000

3

6

9

12

15

18

20

Number of Nodes

Nu

mb

er o

f It

era

tio

ns

Figure 3. Time complexity of MINProp. The plots show the scalability and the convergence ofMINProp under various sizes of heterogenous networks. The x-axis is the size of the heterogenousnetwork. (A) The y-axis is the running time for reaching the convergence. (B) The y-axis is the numberof iterations for reaching the convergence. Nine di"erent pairs of (!1, !2) are tested.

20

Tables

Table 1. Disease gene prioritization performance in leave-one-out cross-validation. Thistable reports the average AUC scores and the pairwise win/draw/loss comparisons between MINPropand the baselines, Random Walk (RW) and CIPHER DN or SP, across the query phenotypes in the twosettings.

MethodsAssociations with known disease genes Associations with new disease genes

Avg. AUC (win/draw/loss) Avg. AUC (win/draw/loss)MINProp vs. Random Walk 0.805 vs. 0.797 (738/75/313) 0.728 vs. 0.648 (1045/2/79)MINProp vs. CIPHER-DN 0.863 vs. 0.738 (565/5/288) 0.821 vs. 0.738 (515/11/332)MINProp vs. CIPHER-SP 0.805 vs. 0.734 (678/8/440) 0.728 vs. 0.729 (538/54/534)

Table 2. Disease gene prioritization performance in the case study. This table reports theaverage AUC scores and the pairwise win/draw/loss comparisons between MINProp and the baselines,Random Walk and CIPHER SP, across the query phenotypes in the two settings.

MethodsAssociations with known disease genes Associations with new disease genes

Avg. AUC (win/draw/loss) Avg. AUC (win/draw/loss)MINProp vs. Random Walk 0.985 vs. 0.926 (71/11/19) 0.716 vs. 0.696 (198/3/92)MINProp vs. CIPHER SP 0.985 vs. 0.703 (101/0/0) 0.716 vs. 0.667 (200/0/93)

21

Table 3. Ranking disease genes of the phenotypes of rare cancers, syndromes anddisorders. The status column indicates whether the disease gene is among previously known diseasegenes (known) or non-disesease genes (new). The ranking and the percentages are calculated from twodi"erent sets of control genes based on the gene’s status, i.e. for a previously known disease gene, thecontrol genes are all the genes, and for a non-disease gene, the control genes are the non-disease genes.

MIM# Phenotype Name HGNC SymbolRanking (%) Ranking (%) Ranking (%)

StatusMINProp CIPHER SP Random Walk

171300 PHEOCHROMOCYTOMA

VHL 1 (0.000) 1105 (0.138) 1105 (0.138) newGDNF 16 (0.002) 1400 (0.157) 1400 (0.157) knownKIF1B 228 (0.028) 512 (0.064) 512 (0.064) newRET 2894 (0.362) 1232 (0.154) 3104 (0.388) new

607174 MENINGIOMA, FAMILIALNF2 1 (0.000) 1279 (0.143) 1279 (0.143) known

PTEN 5 (0.001) 1307 (0.147) 1307 (0.147) knownMN1 5986 (0.748) 1685 (0.211) 5797 (0.724) new

600880 BUDD-CHIARI SYNDROMEF5 1 (0.000) 398 (0.050) 398 (0.050) new

JAK2 51 (0.006) 497 (0.062) 161 (0.020) new

166710 OSTEOPOROSIS

LRP5 2 (0.000) 1541 (0.173) 1541 (0.173) knownCALCR 4 (0.000) 7661 (0.957) 7661 (0.957) newCOL1A1 5 (0.001) 8086 (0.907) 8086 (0.907) known

VDR 42 (0.005) 1402 (0.157) 1402 (0.157) knownCOL1A2 3255 (0.407) 7182 (0.897) 3397 (0.424) new

601626 LEUKEMIA, ACUTE MYELOID

MLF1 3 (0.000) 4323 (0.540) 4323 (0.540) newJAK2 15 (0.002) 354 (0.044) 70 (0.009) newETV6 23 (0.003) 769 (0.096) 769 (0.096) newGMPS 245 (0.031) 4512 (0.564) 4512 (0.564) newCEBPA 526 (0.066) 2376 (0.297) 708 (0.088) newSH3GL1 629 (0.079) 1047 (0.131) 227 (0.028) newFLT3 1860 (0.232) 2107 (0.263) 2463 (0.308) newNPM1 1880 (0.235) 1177 (0.147) 2396 (0.299) newRUNX1 1887 (0.212) 4156 (0.466) 1834 (0.206) known

KIT 3377 (0.379) 4024 (0.451) 3227 (0.362) knownCHIC2 3881 (0.485) 3807 (0.476) 4376 (0.547) newLPP 4141 (0.517) 4330 (0.541) 3724 (0.465) newNSD1 4388 (0.492) 502 (0.056) 4658 (0.522) known

NUP214 4962 (0.620) 5311 (0.664) 4933 (0.616) newARHGEF12 6202 (0.775) 5714 (0.714) 5841 (0.730) new

202300 ADRENOCORTICAL CARCINOMA TP53 1 (0.000) 1249 (0.140) 61 (0.007) known300299 NEUTROPENIA WAS 1 (0.000) 1656 (0.186) 30 (0.003) known

601367 STROKE, ISCHEMIC

PRKCH 14 (0.002) 448 (0.056) 448 (0.056) newALOX5AP 154 (0.017) 7892 (0.885) 7892 (0.885) known

NOS3 1243 (0.155) 4096 (0.512) 866 (0.108) newF5 2111 (0.237) 4073 (0.457) 2222 (0.249) known

188470 THYROID CARCINOMA

PTEN 20 (0.002) 710 (0.089) 710 (0.089) newHRAS 60 (0.007) 1182 (0.133) 1182 (0.133) knownNRAS 719 (0.081) 2047 (0.230) 2047 (0.230) known

MINPP1 2100 (0.262) 4508 (0.563) 2720 (0.340) new

604131 THALASSEMIASHBB 24 (0.003) 2025 (0.227) 2025 (0.227) knownHBA2 40 (0.005) 445 (0.056) 91 (0.011) new

177900 PSORIASIS SUSCEPTIBILITY 1HLA-C 50 (0.006) 2368 (0.296) 2368 (0.296) newIL23R 397 (0.045) 2297 (0.258) 601 (0.067) known

273300 TESTICULAR TUMORSSTK11 78 (0.009) 819 (0.092) 633 (0.071) knownKIT 88 (0.011) 3437 (0.429) 652 (0.081) new

Table 4. Ranking performance on phenotypes with di"erent gene clustering coe!cients.The table reports relative performance of the MINProp with Random Walk (RW), CIPHER DN & SPand Hybrid (Combined ranking of MINProp and CIPHER SP).

CCMINProp vs. RW MINProp vs. CIPHER-DN MINProp vs. CIPHER-SP Hybrid vs. MINProp

Avg. AUC Avg. AUC Avg. AUC Avg. AUC[0.1, 1] 0.875 vs. 0.776 0.889 vs. 0.855 0.875 vs. 0.813 0.886 vs. 0.875

[0.01, 0.1) 0.902 vs. 0.799 0.906 vs. 0.799 0.902 vs. 0.801 0.911 vs. 0.902[0, 0.1) 0.653 vs. 0.586 0.770 vs. 0.688 0.654 vs. 0.693 0.692 vs. 0.654Total 0.728 vs. 0.648 0.821 vs. 0.738 0.728 vs. 0.729 0.756 vs. 0.727

2215

!"!

!"!

!"!

!"!

!!!"#$%&$%'()%*+,-.%!/#0#1&2#,-!*%,3+24!

5%*%'5%*%!6*,%2&78+*!*%,3+24!

()%*+,-.%'5%*%!*%,3+24!

!"!

!"#$%&&'(#)*+%,#&

92%&$,!7&*7%2!

:;*<!7&*7%2!

=>%$#,-!

"#&>%,%$!

!"!

!"! !"! !"!

!"!

#!

"!

"!

"!

-$"#&./0#10#&2#)#&

!"!

!"!

"#$%&$%!5%*%!

?+*'"#$%&$%!5%*%!

"#$%&$%!()%*+,-.%!

3#4*5#&+(#&6/$#7+&8/)9&*)8%&

!"#"$!"#"%&#'"()*+,#%#"'-,(.%

/0"#,'12"$!"#"%#"'-,(.%

!"#$%&&'(#)*+%,#&

3(")4'%*)#*"(%

56#7%*)#*"(%

89"4:'1%

;:)9"'"4%

!"#$

!"%$

!"&$

!"'$

-$"#&./0#10#&2#)#&

!"($!")$

!")$!"'$

!")$

!"*'$

!"*$

!"&#$

!"&+$

!"&)$

!"&'$

!"*&$

%%%;:4")4"$/0"#,'12"%<:=:>)(:'1%#"'-,(.%

;:4")4"%!"#"%

?,#$;:4")4"%!"#"%

;:4")4"%/0"#,'12"%!"#$%"&'()#)*'

(A) Discovering new phenotype-gene associations (B) Discovering new disease genes

Figure 3. Illustrations of gene prioritization experiments. (A) In the experiment of discoveringnew phenotype-gene associations, only the link between the disease gene and the query was removed.The other link was kept. The final ranking is measured on all the genes in the gene-gene interactionnetwork. (B) In the experiment of discovering new disease genes, we removed the link between thedisease gene of the query and the link between this gene and other phenotypes (the dashed linesconnecting the true disease gene of the query with all the phenotypes). The final ranking is onlymeasured on the disease gene plus the non-disease genes. The top figures show the initialization to runMINProp and Random Walk with restart. We first initialize the label score of the query vertex with 1and the other vertices with zero. The bottom figures shows the results after label propagationconverges. After label propagation converges, each vertex has a final label score. We rank the truedisease gene of the query phenotype and all the non-disease genes according to the label scores assignedby the algorithm. Note that the final label scores in the figure are artificially assigned in this example.This example shows gene ranking to discover disease genes of breast cancer: we assign label score one tobreast cancer and zero to the other disease phenotypes and the genes. We remove the edges betweenthe disease gene and breast cancer (or all the phenotypes) from the network. After label propagationconverges, we rank the true disease gene of breast cancer with non-disease genes according to the labelscores.

15

!"!

!"!

!"!

!"!

!!!"#$%&$%'()%*+,-.%!/#0#1&2#,-!*%,3+24!

5%*%'5%*%!6*,%2&78+*!*%,3+24!

()%*+,-.%'5%*%!*%,3+24!

!"!

!"#$%&&'(#)*+%,#&

92%&$,!7&*7%2!

:;*<!7&*7%2!

=>%$#,-!

"#&>%,%$!

!"!

!"! !"! !"!

!"!

#!

"!

"!

"!

-$"#&./0#10#&2#)#&

!"!

!"!

"#$%&$%!5%*%!

?+*'"#$%&$%!5%*%!

"#$%&$%!()%*+,-.%!

3#4*5#&166&6/)70&*8&9/0#10#&:#)#0&

!"#"$!"#"%&#'"()*+,#%#"'-,(.%

/0"#,'12"$!"#"%#"'-,(.%

!"#$%&&'(#)*+%,#&

3(")4'%*)#*"(%

56#7%*)#*"(%

89"4:'1%

;:)9"'"4%

!"#$

!"%$

!"&$

!"'$

-$"#&./0#10#&2#)#&

!"($!")$

!")$!"'$

!")$

!"*'$

!"*$

!"&'$

!"&+$

!"&)$

!"&'$

!"*&$

%%%;:4")4"$/0"#,'12"%<:=:>)(:'1%#"'-,(.%

;:4")4"%!"#"%

?,#$;:4")4"%!"#"%

;:4")4"%/0"#,'12"%!"#$%"&'()#)*'

(A) Discovering new phenotype-gene associations (B) Discovering new disease genes

Figure 3. Illustrations of gene prioritization experiments. (A) In the experiment of discoveringnew phenotype-gene associations, only the link between the disease gene and the query was removed.The other link was kept. The final ranking is measured on all the genes in the gene-gene interactionnetwork. (B) In the experiment of discovering new disease genes, we removed the link between thedisease gene of the query and the link between this gene and other phenotypes (the dashed linesconnecting the true disease gene of the query with all the phenotypes). The final ranking is onlymeasured on the disease gene plus the non-disease genes. The top figures show the initialization to runMINProp and Random Walk with restart. We first initialize the label score of the query vertex with 1and the other vertices with zero. The bottom figures shows the results after label propagationconverges. After label propagation converges, each vertex has a final label score. We rank the truedisease gene of the query phenotype and all the non-disease genes according to the label scores assignedby the algorithm. Note that the final label scores in the figure are artificially assigned in this example.This example shows gene ranking to discover disease genes of breast cancer: we assign label score one tobreast cancer and zero to the other disease phenotypes and the genes. We remove the edges betweenthe disease gene and breast cancer (or all the phenotypes) from the network. After label propagationconverges, we rank the true disease gene of breast cancer with non-disease genes according to the labelscores.

A B

Figure 4. Illustrations of gene prioritization experiments. (A) In the experiment of uncoveringassociations with known disease genes, only the link between the disease gene and the query wasremoved. The other link was kept. The final ranking is measured on all the genes in the gene-geneinteraction network. (B) In the experiment of discovering new disease genes, we removed the linkbetween the disease gene of the query and the link between this gene and other phenotypes (the dashedlines connecting the true disease gene of the query with all the phenotypes). The final ranking is onlymeasured on the disease gene plus the non-disease genes. The top figures show the initialization to runMINProp and Random Walk with restart. We first initialize the label score of the query vertex with 1and the other vertices with zero. The bottom figures shows the results after label propagationconverges. After label propagation converges, each vertex has a final label score. We rank the truedisease gene of the query phenotype and all the non-disease genes according to the label scores assignedby the algorithm. Note that the final label scores in the figure are artificially assigned in this example.This specific example shows the gene ranking task in discovering a disease gene of breast cancer: weassign label score one to breast cancer and zero to the other disease phenotypes and the genes. Weremove the edges between the disease gene and breast cancer (or all the phenotypes) from the network.After label propagation converges, we rank the true disease gene of breast cancer with non-disease genesaccording to the label scores.

23

A D

B E

C F

Associations with new disease genes Associations with known disease genes

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs b

y R

an

do

m W

alk

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs b

y R

an

do

m W

alk

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

CIP

HE

R D

N

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

CIP

HE

R D

N

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs b

y C

IPH

ER

SP

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

CIP

HE

R S

P

A D

B E

C F


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs b

y R

an

do

m W

alk

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs b

y R

an

do

m W

alk

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs b

y C

IPH

ER

DN

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs b

y C

IPH

ER

DN

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

CIP

HE

R S

P

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

CIP

HE

R S

P

Figure 5. Query-wise scatter plot of the AUC scores in leave-one-out cross-validation. Inthe scatter plots, the x-axis and the y-axis of a dot are the AUCs on a query phenotype by MINPropand the compared method, respectively. (A)-(C) The three figures on the left are from the experimentsof uncovering associations with known disease genes. (D)-(F) The right figures are from theexperiments on discovering new disease susceptibility genes from unknown disease genes.

24

A C

B D

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

Ra

nd

om

Wa

lk

Known Disese Genes

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

CIP

HE

R S

P

Known Disease Genes

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

Ra

nd

om

Wa

lk

Unknown Disease Genes

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

CIP

HE

R S

P



A C

B D

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

Ra

nd

om

Wa

lk

Known Disese Genes

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

CIP

HE

R S

P

Known Disease Genes

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

Ra

nd

om

Wa

lk


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUCs by MINProp

AU

Cs

by

CIP

HE

R S

P



Figure 6. Query-wise scatter plot of the AUC scores in the case study. In the scatter plots,the x-axis and the y-axis of a dot are the AUCs on a query phenotype by MINProp and the comparedmethod, respectively. (A)-(B) The two figures on the left are from the experiments of uncoveringassociations with known disease genes. (C)-(D) The right figures are from the experiments ondiscovering new disease susceptibility genes from unknown disease genes.

25

Phenotype

Cell to Cell Signaling

DNA Replication

Cell Cycle

Cell Signaling Cell Signaling

Cell Death

Cell Signaling Cellular Assembly

Cell Signaling

Cellular Movement

Gene Expression Cellular Growth

AGe

ne

Figure 7. Modular view of human phenome-genome association. This map shows modularstructures of disease phenotypes caused by functionally and topologically related genes and phenotypes.The color describes the degree of association between phenotypes and genes.

technical report - cs.umn.edu · sources including functional annotations, gene expressions and...

Documents