inference of sparse networks with unobserved …proceedings.mlr.press/v9/slavov10a/slavov10a.pdffa...

757

Inference of Sparse Networks with Unobserved Variables.Application to Gene Regulatory Networks

Nikolai SlavovPrinceton University, NJ [email protected]

Abstract

Networks are becoming a unifying frame-work for modeling complex systems and net-work inference problems are frequently en-countered in many fields. Here, I developand apply a generative approach to networkinference (RCweb) for the case when the net-work is sparse and the latent (not observed)variables affect the observed ones. Fromall possible factor analysis (FA) decompo-sitions explaining the variance in the data,RCweb selects the FA decomposition thatis consistent with a sparse underlying net-work. The sparsity constraint is imposed by anovel method that significantly outperforms(in terms of accuracy, robustness to noise,complexity scaling and computational effi-ciency) Bayesian methods and MLE methodsusing `1 norm relaxation such as K–SVD and`1–based sparse principle component analy-sis (PCA). Results from simulated modelsdemonstrate thatRCweb recovers exactly themodel structures for sparsity as low (as non-sparse) as 50% and with ratio of unobservedto observed variables as high as 2. RCweb isrobust to noise, with gradual decrease in theparameter ranges as the noise level increases.

1 INTRODUCTION

Factor analysis (FA) decompositions are useful for ex-plaining the variance of observed variables in termsof fewer unobserved variables that may capture sys-tematic effects and allow for low dimensional repre-sentation of the data. Yet, the interpretation of la-

Appearing in Proceedings of the 13th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 ofJMLR: W&CP 9. Copyright 2010 by the authors.

tent variables inferred by FA is fraught with problems.In fact, interpretation is not always expected and in-tended since FA may not have an underlying gener-ative model. A prime difficulty with interpretationarises from the fact that any rotation of the factors andtheir loadings by an orthogonal matrix results in a dif-ferent FA decomposition that explains the variance inthe observed variables just as well. Therefore, in theabsence of additional information on the latent vari-ables and their loading, FA cannot identify a uniquedecomposition, much less generative relationships be-tween latent factors and observed variables.

A frequent choice for a constraint implemented byprinciple component analysis (PCA) and resulting ina unique solution is that the factors are the singu-lar vectors (and thus orthogonal to each other) of thedata matrix ordered in descending order of their cor-responding singular values. Yet, this choice is oftenmotivated by computational convenience rather thanby knowledge about the system that generated thedata. Another type of computationally convenientconstraint applied to facilitate the interpretation ofFA results is sparsity, in the form of sparse BayesianFA (West 2002, Dueck 2005, Carvalho 2008), sparsePCA (d’Aspremont 2007, Sigg 2008) and FA for generegulatory networks (Srebro 2001, Pe’er 2002). How-ever, papers that introduce and use sparse PCA donot consider a generative model but rather use spar-sity as a convenient tool to produce interpretable fac-tors that are linear combinations of just a few originalvariables. In sparse PCA, sparsity is a way to balanceinterpretability at the cost of slightly lower fraction ofexplained variance.

The algorithm described in this paper (RCweb) alsouses a sparse prior, but RCweb explicitly considers theproblem from a generative perspective. RCweb assertsthat there is indeed a set of hidden variables that con-nect to and regulate the observed variables via a sparsenetwork. Based on that model, I derive a networkstructure learning approach within explicit theoreticalframework. This allows to propose an approach for

mailto: [email protected]

758

Inference of Sparse Networks with Unobserved Variables. Application to Gene Regulatory Networks

sparse FA which is conceptually and computationallydifferent from all existing approaches such as K-SVD(Aharon 2005), sparse PCA and other LARS (Bradley2004) based methods (Banerjee 2007, Friedman 2008).RCweb is appropriate for analyzing data arising fromany system in which the state of each observed vari-able is affected by a strict subset of the unobservedvariables. To assign the inferred latent variables tophysical factors, RCweb needs either data from per-turbation experiments or prior knowledge about thefactors. This framework generalizes to non-linear in-teractions, which is discussed elsewhere. Furthermore,I analyze the scaling of the computational complexityofRCweb with the number of observed and unobservedvariables, as well as the parameter space where RCwebcan accurately infer network topologies and demon-strate its robustness to noise in the data.

2 DERIVATION

Consider a sparse bipartite graph G = (E ,N ,R) con-sisting of two sets of vertices N and R and the asso-ciated set of directed edges E connecting R to N ver-tices. Define a graphical model in which each vertexs corresponds to a random variable; N observed ran-dom variables indexed byN (xN = {xs|s ∈ N}) whosestates are functions of P unobserved variables indexedbyR, xR = {xs|s ∈ R}. Since the states of xN dependon (are regulated by) xR, I will also refer to xR as reg-ulators. The functional dependencies are denoted by aset of directed edges E so that each unobserved variablexi|i ∈ R affects (and its vertex is thus connected to)a subset of observed variables xαi

= {xs|s ∈ αi ⊂ N}.Given a dataset G ∈ RM×N of M configurations ofthe observed variables xN , RCweb aims to infer theedges E and the corresponding configurations of theunobserved variables, xR.

If the state of each observed variable is a linear super-position of a subset of unobserved variables, the dataG can be modeled with a very simple generative model(1): The data is a product between R ∈ RM×P (amatrix whose columns correspond to the unobservedvariables and the rows correspond to the M measuredconfigurations) and C ∈ RP×N , the weighted adja-cency matrix of G. The unexplained variance in thedata G is captured by the residual Υ.

G = RC + Υ (1)

This decomposition of G into a product of two matri-ces can be considered to be a type of factor analysiswith R being the factors and C the loadings. Evenwhen P � M the decomposition of G does not havea unique solution since RC ≡ RIC ≡ RQTQC ≡R∗C∗ for any orthonormal matrix Q. Thus the iden-tification of a unique decomposition corresponding to

the structure of G requires additional criteria con-straining the decomposition. The assumption that Gis sparse requires that C be sparse as well meaningthat the state of the ith observed vertex xi|i ∈ N isaffected by a strict subset of the unobserved variablesxψi = {xs|s ∈ ψi ⊂ R}, the ones whose weights in theith column of C are not zeros, Ciψi

6= 0 and Ciψi= 0

where ψi is the complement of ψi. Thus, to recover thestructure of G, RCweb seeks to find a decompositionof G in which C is sparse. The sparsity can be intro-duced as a regularization with a Lagrangian multiplierλ:

(C, R) = arg minR,C‖G−RC‖2F + λ‖C‖0 (2)

In the equations above and throughout the paper‖C‖2F =

∑i,j C

2ij denotes entry-wise (Frobenius) norm,

and the zero norm of a vector or matrix (‖C‖0) equalsthe number of non-zero elements in the array.

To infer the network topology RCweb aims to solvethe optimization problem defined by (2). Since (2) isa NP-hard combinatorial problem, the solution can besimplified significantly by relaxing the `0 norm to `1norm (Bradley 2004). Then the approximated prob-lem can be tackled with interior point methods (Baner-jee 2007). As an alternative approach to `1 approxima-tion, I propose a novel method based on introducing adegree of freedom in the singular-value decomposition(SVD) of G by inserting an invertible1 matrix B.

G = USVT ≡(US(BT )−1

)︸︷︷︸R

(BTVT

)︸︷︷︸C

(3)

The prior (constraint) that C is sparse determines Bthat minimizes (2), and thus a unique decomposition.The goal of introducing B is to reduce the combina-torial problem to one that can be solved with convexminimization. When the factors underlying the ob-served variance are fewer than the observations in Gthere is no need to take the full SVD; if P factorsare expected, only the first P largest singular vectorsand values from the SVD of G are taken in that de-composition so that USVT is the matrix with rank Pthat best approximates G in the sense of minimizing‖G − USVT ‖2F . Since conceivably sparse decompo-sitions may use columns outside of the best `2 ap-proximation, RCweb considers taking the first P ∗ forP ∗ > P singular pairs. Such expanded basis is morelikely to support the optimal sparse solution and es-pecially relevant for the case when P is not known.Such choice can be easily accommodated in light ofthe ability of RCweb to exclude unnecessary explana-tory variables, see section 4.3.

1B is always invertible by construction, see section 4.3and equation (4)

759

Nikolai Slavov

Next RCweb computes B based on the requirementthat C is sparse for the case N > P . To computeB, one may set an optimization problem (4). OnceB is inferred, R and C can be computed easily, R =USB−1 and C = (VB)T .

B = arg minB‖VB‖0, so that det(B) > 1 (4)

The constraint on B in (4) is introduced to avoid trivialand degenerate solutions, such as B being rank defi-cient B. Thus the introduction of B reduces (2) to aproblem (4) that is still combinatorial and might alsobe approximated with a more tractable problem by re-laxing the `0 norm to `1 norm and applying heuristics(Cetin 2002, Candes 2007) to enhance the solution. Ipropose a new approach, RCweb, outlined in the nextsection.

3 RCweb

Assume that a sparse cTi corresponding to an optimalbi (the ith column of B) form Vbi = cTi is known.Define the set of indices corresponding to non-zero el-ements in cTi with ω− and the set of indices corre-sponding to zero elements in cTi with ω0. Furthermore,define the matrix Vω0 to be the matrix containing onlythe rows of V whose indices are in ω0. If ω0 and thusVω0 are known, one can easily compute bi as the rightsingular vector of Vω0 corresponding to the zero sin-gular value. Since cTi and ω0 are not known, RCwebapproximates bi (the smallest2 right singular vector ofVω0) with vs, the smallest right singular vector of V.This approximation relies on assuming that a low rankperturbation in a matrix results in a small change in itssmallest singular vectors (Wilkinson 1988, Benaych-Georges 2009). Thus given that RCweb is looking forthe sparsest solution and the set ω− is small relativeto N , the angle between the singular vectors of Vω0

and V is small as well. Therefore, vs can serve asa reasonable first approximation of bi. Then RCwebsystematically and iteratively uses and updates vs byremoving rows of V until vs converges to bi or equiv-alently Vω0 becomes singular for the largest set of ω0

indices. When Vω0 becomes singular, all elements ofci whose indices are in ω0 become zero.

RCweb also has an intuitive geometrical interpretation.Consider the matrix V mapping the unit sphere in RP(the sphere with unit radius from RP ) to an ellipsoidin RN . The axes of the ellipsoid are the left singularvectors of V. In this picture, starting with ω− = {∅}and ω0 = {1, . . . , N}, solving (4) requires moving thefewest number of indices from ω0 to ω− so that Vω0

2By smallest singular vector I mean the singular vectorcorresponding to the to the smallest singular value

maps a vector from RP to the origin. How to choosethe indices to be moved? At each step RCweb choosesi|max|, the index of the largest element (by absolutevalue) of the smallest axis of the ellipsoid which is theleft singular vector of V with the smallest singularvalue. RCweb moves i|max| from ω0 to ω− effectivelyselecting the dimension whose projection is easiest toeliminate and removing its largest component, whichminimizes as much as possible the projection in thatdimension. RCweb keeps moving indices from ω0 toω− using the same procedure until the smallest rightsingular vector of Vω0 converges to bi and the small-est singular value of Vω0 approaches zero. RCweb isguaranteed to stop after at most (N−P+1) steps sinceafter removal of (N−P+1) indices from ω0, Vω0 will beat most rank P−1. If RCweb finds a sparse solution itwill converge in fewer steps.

1. Task:bi = min

bTi bi≥1

‖Vbi‖0

2. Initialization:

• ω− = {∅} and ω0 = {1, 2, . . . , N}• Set K−1

ω0= (VT

ω0Vω0)−1 = I ∈ RP×P

• Set J = 1;

• i|max| = arg max(∑

j |Vij |)

ω− = {i|max|}, ω0 = {i|i ∈ ω0, i 6= i|max|}• Update K−1

ω0= RankUpdate(K−1

ω0,Vi|max|)

3. Cycle: J = J + 1 Repeat until convergence

• Find the eigenvector v for K−1ω0

with thelargest eigenvalue λmax

• If λ−1max ≈ 0 or vJ → vJ−1, bi ≡ v; STOP

• Compute the left singular vector of Vω0

u = s−1Vω0v• i|max| = arg max [(|u1|, . . . , |ui|, . . . , |uN |)];• ω− = {ω−, i|max|}ω0 = {i|i ∈ ω0, i 6= i|max|}

• Update K−1ω0

= RankUpdate(K−1ω0,Vi|max|)

The above algorithm can compute a single vector, bi,which is just one column of B. To find the othercolumns, RCweb applies the same approach to themodified (inflated) matrix, which for the ith columnof B is V(i) = V(i−1) +VbibT

i for i = 2, . . . , P . Thus,after the inference of each column of B RCweb modi-fies V(i−1) to V(i) (V(1) ≡ V) so that the algorithmwill not replicate its choice of ω0. Note that in theith update of V(i−1) the VbibT

i will modify only theω− rows in V(i−1) since the rows of the VbibT

i whoseindices are in ω0 contain only zero elements. Apply-ing RCweb to the inflated matrices avoids inferring

760


multiple times the same bi, but a bi inferred fromthe inflated matrix is generally going to differ at leastslightly from the corresponding bi that solves (4) forV. To avoid that, RCweb uses the inflated matricesonly for the first few iterations until the largest (byabsolute magnitude) of the Pearson correlations be-tween the smallest eigenvector of Vω0 from the current(ith) iteration and the recovered columns of B is lessthan 1 − ε and monotonically decreasing; ε is chosenfor numerical stability and also to reflect the similar-ity between the connectivity of R vertices that canbe expected in the network whose topology is beingrecovered. A simpler alternative that works great inpractice is to use the the inflated matrix for the firstk iterations that are enough to find a new directionfor bi and then RCweb switches back to V so thatthe solution is optimal for V. The switch requires krank update of K−1

ω0≡ (VT

ω0Vω0)−1 and thus choosing k

small saves computations. Choosing k too small, how-ever, may not be enough to guarantee that bi will notrecapitulate a solution that is already found. Usuallyk = 10 works great and can be easily increased if thenew solution is very close to an old one.

There are a few notable elements that make RCwebefficient. First, RCweb does almost all computationsin RP and since P � N , P < M , that saves bothmemory and CPU time. Second, each step requiresonly a few matrix-vector multiplication for computingthe eigenvectors (since the change from the previousstep is generally very small) and K−1

ω0is computed by

a rank–one update which obviates matrix inversion.

The approach that RCweb takes in solving (4) doesnot impose specific restrictions on the distribution ofthe observed variables (G), the noise in the data (Υ)or the latent variables R. However, the initial approx-imation of bi with vs can be poor for data arising fromdense networks or special worst–case datasets. Asdemonstrated theoretically (Benaych-Georges 2009)and tested numerically in the next section, RCweb per-forms very well at least in the absence of worst–casescenario special structures in the data.

4 VALIDATION

To evaluate the performance of RCweb, I first applyit to data from simulated random bipartite networkswith two different topology types, (1) Erdos & Renyiand (2) scale-free whose corresponding degree distribu-tions are (1) Poisson and (2) power–law. The networktopology is encoded in a weighted adjacency matrixCgold and the values for the unobserved variables aredrawn from a standard uniform distribution. The sim-ulations result in data matrices G ∈ RM×N containingM observations of all N unobserved variables. Accord-

ing to RCweb, the optimal sparse adjacency matrix(C) and the hidden variables (R) can be inferred bythe decomposition, G = RC so that C is as sparse aspossible while G is as close as possible to G.

In addition toRCweb, such decomposition can be com-puted by 3 classes of existing algorithms. For a com-parison, I use the latest versions for which the authorsreport best performance: (A) PSMF for Bayesian ma-trix factorization as implemented by the author Mat-Lab function PSMF1 (Dueck 2005); (B) BFRM 2 forBayesian matrix factorization as implemented by theauthor compiled executable (Carvalho 2008); (C) em-PCA for maximum likelihood estimate (MLE ) sparsePCA (Sigg 2008); (D) K-SVD (Aharon 2005). All al-gorithms are implemented using the code published bytheir authors, and with the default values of the pa-rameters when parameters are required. The resultsare compared for various M, N, P, sparsity, and noiselevels.

4.1 LIMITATIONS

Before comparing the results, consider some of the lim-itations common to all algorithms and the appropriatemetrics for comparing the results. In the absence ofany other information, the decomposition of G (nomatter how accurate) cannot associate hidden vari-ables (corresponding to columns of R) to physical fac-tors. Furthermore, all methods can infer C and Ronly up to an arbitrary diagonal scaling or permuta-tion matrix. First consider the scaling illustrated bythe following transformation by a diagonal matrix D,G = RC = RIC = R(DD−1)C = (RD)(D−1C) =R∗C∗. Such transformation is going to rescale Rto R∗ and C to C∗, which is just as sparse as C,‖C∗‖0 = ‖C‖0. Since both decompositions explainthe variance in G equally well RCweb (or any of theother method) cannot distinguish between them. Thusgiven C, there is a diagonal matrix D that scales C toCgold, the weighted adjacency matrix of G.

The second limitation is that in the absence of addi-tion information, RCweb can determine C and R upto a permutation matrix. Consider comparing C tothe adjacency matrix used in the simulations, Cgold.Since the identity of the inferred hidden variables isnot known the rows of C do not generally correspondto the rows of Cgold; Ci (the ith row of C) is mostlikely to correspond to the Cgold row that is most cor-related to Ci and the Pearson correlation between thetwo rows quantifies the accuracy for the inference ofCi. To implement this idea, all rows of C and Cgold

are first normalized to mean zero and unit varianceresulting in Cnor and Cgold

nor . The correlation matrixthen is, Σ = CT

norCgoldnor and the most likely vertex

(index of the unobserved variable) corresponding to

761

Nikolai Slavov

Ci is k = arg maxj(|Σi1|, . . . , |Σij |, . . . , |ΣiP |), wherek ∈ R. The absolute value is required because the di-agonal elements of D can be negative. The accuracyis measured by the corresponding Pearson correlation,ρi = |Σik|. An optimal solution of this matching prob-lem can be found by using belief propagation algorithmfor the simple case of a bipartite graph even the LPrelaxed version guarantees optimal solution (Sanghavi2007). The overall accuracy is quantified by the meancorrelation ρ = (1/p)

∑i=pi=1 ρi where p is the number

of inferred unobserved variables and can equal to P ornot depending on whether the number of unobservedvariables is known or not. In computing ρ, each rowof Cgold is allowed to correspond only to one row of Cand vice versa.

In addition to the two common limitations of permuta-tion and scaling, some algorithms (d’Aspremont 2007)for sparse PCA require M > N and since this is notthe case in many real world problems and in some ofthe datasets simulated here, those methods are nottested. Instead I chose emPCA, which does not re-quire M > N and is the latest MLE algorithm forsparse PCA that according to its authors is more effi-cient than previous algorithms (Sigg 2008).

4.2 ACCURACY AND COMPLEXITYSCALING

RCweb has a natural way for identifying the mean de-gree3. However, some of the other algorithms requirethe mean degree for optimal performance. To avoidunderestimating an algorithm simply because it recov-ers networks that are too sparse or not sparse enough,I assume that the mean degree is known and it is in-put to all algorithms. First all algorithms are testedon a very easy inference problem, Fig.1. Since PSMFand BFRM have lower accuracy and PSMF is signif-icantly slower than the other algorithms, the rest ofthe results will focus on the MLE algorithms that alsohave better performance. PSMF gives less accurateresults with power–law networks which can be under-stood in terms of the uniform prior used by PSMF.PSMF has the advantage over the MLE algorithms ininferring a probabilistic network structure rather thana single estimate. Special advantage of BFRM is theseamless inclusion of response variables and measuredfactors in the inference. For the MLE algorithms, theaccuracy of network inference increases with the ra-tio of observed to unobserved variables N/P (Fig.2)and with the number of observed configurations M ,Fig.3. In contrast, as the noise in the data and themean out–degree (mean number of edges from R to N

3When RCweb learns all edges, the smallest singularvalue of Vω0 approaches zero and its smallest singular vec-

tor converges to bi.

10 15 20 25 30 35 40 45 50 55 600

0.2

0.4

0.6

0.8

1.0

Number of unobserved variables, P

Infe

renc

eA

ccur

accy

,ρ

RCweb emPCA kSVD PSMF BFRM

Figure 1: Accuracy of network recovery as a functionof the number of unobserved variables, P . Numberof observed variables N = 500; Number of observa-tions, M = 280. All networks are with Poisson meanout–degree 0.10N = 50, with 10 % noise in the obser-vations.

vertices) are increased, the accuracy of the inferencedecreases. All algorithms perform better on Poissonnetworks and the lower level of noise in the data frompower–law networks was chosen to partially compen-sate for that. An important caveat when comparingthe results for different algorithms is that K-SVD it-eratively improves the accuracy of the solution, andthus the output is dependent on the maximum num-ber of iterations allowed (Imax). For the results here,Imax = 20 and the accuracy of K-SVD may improvewith higher number of iterations even though I didnot observe significant improvement with Imax = 100.Even at 20 iterations K-SVD is significantly slowerthan RCweb and emPCA, Fig.4. The scaling of the al-gorithms with respect to a parameter was determinedby holding all other parameters constant and regress-ing the log of the CPU time against the log of thevariable parameter, Fig.4. The scaling with respectto some parameters is below the theoretical expecta-tion since the highest complexity steps may not bespeed–limiting for the ranges of parameters used inthe simulations.

4.3 INTERPRETABILITY

In analyzing real data the number of unobserved vari-ables (P ) may not be known. I observe that if a sim-ulated network having P regulators is inferred assum-ing P ∗ > P regulators, the elements of B correspond-ing to the excessive regulators are very close to zero,|Bij | ≤ 10−10 for i > P or j > P . Thus, if the datatruly originate from a sparse network RCweb can dis-card unnecessary unobserved variables.

The results of RCweb can be valuable even without

762


0 20 40 60 80

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of unobserved variables, P

Infe

renc

eA

ccur

accy

,ρ

A

RCwebemPCAkSVD

0 20 40 60 800.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RCwebemPCAkSVD

B

Figure 2: Accuracy of network recovery as a functionof the number of unobserved variables, P . The thickerbrighter lines with squares correspond to number ofobserved variables N = 500 and the dashed, thinnerlines correspond to N = 1000. In all cases the numberof observations is M = 2000. A) Poisson networkswith mean out–degree 0.25N {125 and 250}, with 50 %noise in the observations and B) power–law networks(with mean out–degree 0.40N {200 and 400}, with 10% noise in the observations. C) The same as (B) exceptfor wider range of the observed variables.

identifying the physical factors corresponding to xR.Yet identifying this correspondence, and thus over-coming the limitations outlined in section 4.1 can bevery desirable. One practically relevant situation al-lowing in-depth interpretation of the results requiresmeasuring (if only in a few configurations) the statesof some of the variables that are generally unobserved(xR). This is relevant, for example, to situations inwhich measuring some variables is much more expen-sive than others (such as protein modifications versusmessenger RNA concentrations) and some xs∈R can beobserved only once or a few times. Assume that thestates of the kth physical factor are measured (data invector uk) in nk number of configurations, whose in-dices are in the set φk. This information can be enoughfor determining the vertex xs∈R corresponding to thekth factor and the corresponding Dss as follows: 1)Compute the Pearson correlations ~ρ between uk andthe columns of Rφk

. Then, the vertex of the inferrednetwork most likely to correspond to the kth physi-cal factor is s = arg maxi(|ρ1|, . . . , |ρi|, . . . , |ρP |). 2)

0 200 400 600 800

0.4

0.5

0.6

0.7

0.8

0.9

1

B

RCwebemPCAkSVD

50 100 150 200 250 3000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Infe

renc

eA

ccur

accy

,ρ

A RCwebemPCAkSVD

Number of Observed Configurations, M

Figure 3: Accuracy of network recovery as a functionof the number of observed configurations, M . Contin-uous lines & squares, N = 500; dashed lines & circles,N = 1000 A) Poisson networks with mean out–degree0.20N {100 and 200}, with 50 % noise; B) power–lawnetworks (with mean out–degree 0.40N {200 and 400},with 10 % noise. In all cases P = 30.

101

100

101

102

103

104

105

P

CP

UT

ime,

seco

nds

A

RCweb : 1.78emPCA: 1.03kSVD: 1.40

103

101

102

103

104

105

N

B

RCweb : 1.41emPCA: 1.80kSVD: 0.98

102

100

101

102

103

M

C

RCweb : -0.01emPCA: 1.33kSVD: 0.73

Figure 4: Computational efficiency as a function ofP , M and N . The scaling exponents (slopes in log–logspace) are reported in the legends. The networks arewith power–law degree distribution and 60 % sparsity,with mean out–degree 0.40N

Dss = (RTφk

Rφk)−1RT

φkuk. Similar to section (4.1),

weighted matching algorithms (Sanghavi 2007) can beused for finding the optimal solution if there is datafor multiple xs∈R.

Partial prior knowledge about the structure of G canalso be used to enhance the interpretability of RCwebresults. Assume, for example, that some of the nodes(xs∈αk⊂N ) regulated by the kth physical factor (whichis a hidden variable in the inference) are known. Thenthe matching approach that was just outlined canbe used with C rather than R. If the weights arenot known all non-zero elements of Cαk

can be setto one. The significance of the overlap (fraction of

763

Nikolai Slavov

common edges) of the regulator most likely to corre-spond to the kth physical and its known connectiv-ity (coming from prior knowledge) can be quantifiedby a p-val (the probability of observing such overlapby chance alone) computed from the hyper–geometricdistribution. This approach is exemplified with gene-expression data in the next section.

4.4 INFERENCE OF ATRANSCRIPTIONAL NETWORK

Simulated models have the advantage of having knowntopology, and thus providing excellent basis for rig-orous evaluation. Such rigorous evaluation is notpossible for real biological networks. Yet, existingpartial knowledge of transcriptional networks, canbe combined with the approach outlined in section4.3 for evaluation of inference of biological networks.In particular, consider the transcriptional network ofbudding yeast, Saccharomyces cerevisiae. It can bemodeled as a two-layer network with transcriptionfactors (TFs) and their complexes being regulators(xR) whose activities cannot be measured on a high-throughput scale because of technology limitations.TFs regulate the expression levels of messenger RNAs(mRNAs) whose concentrations can be measured eas-ily on a high-throughput scale at multiple physiolog-ical states and represent the observed variables, xN .A partial knowledge about the connectivity in thistranscriptional network comes from ChIP-chip exper-iments (MacIsaac 2006) which identify sets of genesregulated by the same transcription factors (TFs).

I first apply RCweb to a set mRNA configurations fol-lowed by the approach of section 4.3 to evaluate theresults. The data matrix G contains 423 yeast datasetsmeasured on the Affymetrix S98 microarray platformat a variety of physiological conditions. RCweb wasinitialized with P ∗ = 160 unobserved variables (corre-sponding to TFs) and identified P = 153 co-regulatedsets of genes. The inferred adjacency matrix C is com-pared directly to the adjacency matrix identified byChIP-chip experiments (MacIsaac 2006). Indeed, setsof genes inferred by RCweb to be co-regulated over-lap substantially with sets of genes found to be reg-ulated by the same TF in ChIP-chip studies, Fig.5.Furthermore, the most enriched gene ontology (GO)terms associated with the RCweb inferred sets of genescorrespond to the functions of the respective TFs. Thisfact provides further support to the accuracy of theinferred network topology and suggests that many ofthe genes inferred to be regulated by the correspondingTFs may be bona fide targets that were not detected inthe ChIP-chip experiments because of the rather lim-ited set of physiological conditions in which the exper-iments are performed (MacIsaac 2006). Since the best

Figure 5: Overlap between the first 7 sets of genes(column 1) that RCweb identifies to be regulated by acommon regulator and sets of genes whose promotersare bound by the same TF (column 2) in ChIP-chipexperiments. The second and third columns list thenumber of overlapping genes between the RCweb setand the ChIP-chip set and the corresponding probabil-ities of observing such overlaps by chance alone. Thefourth column contains a description of the TFs fromthe second column and the fifth column lists the mostenriched GO terms for the RCweb gene sets.

“gold standard” is the only partially known topology,comparing the performance of the different algorithmsin this setting can be hard to interpret; an algorithmcan be penalized for correctly inferred edges if theseedges are absent from the incomplete ChIP topology.An objective comparison requires experimental test-ing of predicted new edges and that is the subject ofa forthcoming paper.

The comparison between ChIP-chip gene sets andRCweb gene sets allows to identify the likely corre-spondences between xs∈R and TFs and to infer theactivities of the TFs which are very hard to measureexperimentally but crucially important to understand-ing the dynamics and logic of gene regulatory networks(Slavov 2009).

5 CONCLUSIONS

This paper introduces an approach (RCweb) for in-ferring latent (unobserved) factors explaining the be-havior of observed variables. RCweb aims at inferringa sparse bipartite graph in which vertices connect in-ferred latent factors (e.g. regulators of mRNA tran-scription and degradation) to observed variables (e.g.target mRNAs). The salient difference distinguishingRCweb from prior related work is a new approach toattaining sparse solution that allows the natural inclu-sion of a generative model, relaxation of assumptionson distributions, and ultimately results in more accu-rate and computationally efficient inference comparedto competing algorithms for sparse data decomposi-tion.

764


Acknowledgements

I thank Dmitry Malioutov and Rodolfo Ros Zer-tuche for many insightful discussions and extensive,constructive feedback, as well as David M. Bleiand Kenneth A. Dawson for support and comments.This work was supported by Irish Research Coun-cil for Science, Engineering and Technology (IRC-SET) and by Science Foundation Ireland under GrantNo.[SFI/SRC/B1155].

References

West M. (2002) Bayesian factor regression models inthe large p, small n paradigm. Bayesian Stat, 7, pp.723–732

Carvalho C. et al. (2008) High-dimensional sparsefactor modelling: Applications in gene expression ge-nomics. JASA, 103, 484, pp. 1438–1456

Dueck D., Morris Q.,Frey, B. (2005) Multi-way cluster-ing of microarray data using probabilistic sparse ma-trix factorization. Bioinformatics, 21, pp. 144–151.

d’Aspremont A, Bach F., Ghaoui L. (2007) OptimalSolutions for Sparse Principal Component Analysis.JMLR 9, pp. 1269–1294

Sigg C., Buhmann J. (2008) Expectation-Maximization for Sparse and Non-Negative PCA.ICML, Helsinki, Finland

Srebro N. and Jaakkola T. (2001) Sparse Matrix Fac-torization for Analyzing Gene Expression Patterns,NIPS

Pe’er D., Regev A., and Tanay A. (2002) Minreg: In-ferring an active regulator set. Bioinformatics, 18,258–267

Bradley E., Trevor H., Iain J., Tibshirani R. (2004)Least Angle Regression. Annals of Statistics 32, pp.407–499

Aharon M., Elad M., Bruckstein A. (2005) K-SVDand its non-negative variant for dictionary design.Wavelets XI 5914, pp. 591411–591424.

Banerjee, O., Ghaoui, L. E. & d’Aspremont, A. (2007)Model selection through sparse maximum likelihoodestimation JMLR, 9, pp. 485–516

Cetin M., Malioutov D., Willsky A. (2002) A Varia-tional Technique for Source Localization based on aSparse Signal Reconstruction Perspective. ICASSP,3, pp. 2965-2968, Orlando, Florida

Candes E., Wakin M. and Boyd S. (2007) Enhancingsparsity by reweighted `1 minimization. J. FourierAnal. Appl., 14, pp. 877–905.

Wilkinson, J. H. (1988) The Algebraic EigenvalueProblem, Oxford University Press, USA, ISBN-13:978-0198534181

Benaych-Georges F., Rao R. (2009) The eigenvaluesand eigenvectors of finite, low rank perturbations oflarge random matrices, arXiv:0910.2120v1

Sanghavi S., Malioutov D., Willsky A. (2007) Linearprogramming analysis of loopy belief propagation forweighted matching, NIPS

MacIsaac K.D., at. al. (2006) An improved mapof conserved regulatory sites for Saccharomycescerevisiae. BMC Bioinformatics, 7:113. doi:10.1186/1471-2105-7-113.

Slavov N., Dawson KA. (2009) Correlation Signatureof the Macroscopic States of the Gene Regulatory Net-work in Cancer, PNAS, 106, 11, pp. 4079-4084

inference of sparse networks with unobserved …proceedings.mlr.press/v9/slavov10a/slavov10a.pdffa...

Documents