ieee transactions on nanobioscience, vol. …pinakisa/img/pubs/sarder_et_al...ieee transactions on...

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010 121

Estimating Sparse Gene Regulatory Networks Usinga Bayesian Linear Regression

Pinaki Sarder∗, Student Member, IEEE, William Schierding, J. Perren Cobb, and Arye Nehorai, Fellow, IEEE

Abstract—In this paper, we propose a gene regulatory network(GRN) estimation method, which assumes that such networks aretypically sparse, using time-series microarray datasets. We repre-sent the regulatory relationships between the genes using weights,with the “net” regulation influence on a gene’s expression beingthe summation of the independent regulatory inputs. We esti-mate the weights using a Bayesian linear regression method forsparse parameter vectors. We apply our proposed method to theextraction of differential gene expression software selected genesof a human buffy-coat microarray expression profile dataset ofventilator-associated pneumonia (VAP), and compare the estima-tion result with the GRNs estimated using both a correlation co-efficient method and a database-based method ingenuity pathwayanalysis. A biological analysis of the resulting consensus networkthat is derived using the GRNs, estimated with both our and thecorrelation-coefficient methods results in four biologically mean-ingful subnetworks. Also, our method performs either better thanor competitively with the existing well-established GRN estima-tion methods. Moreover, it performs comparatively with respect to:1) the ground-truth GRNs for the in silico 50- and 100-gene datasetsreported recently in the DREAM3 challenge and 2) the GRN es-timated using a mutual information-based method for the top-ranked Bayesian analysis of time series (a Bayesian user-friendlysoftware for analyzing time-series microarray experiments) se-lected genes of the VAP dataset.

Index Terms—BANJO, Bayesian analysis of time series (BATS),Bayesian linear regression, correlation coefficient, DREAM, ex-traction of differential gene expression software (EDGE), gene reg-ulatory network (GRN), ingenuity pathway analysis (IPA), mutualinformation, network identification by multiple regression (NIR)[time series network identification (TSNI)], sparsity, ventilator-associated pneumonia (VAP).

I. INTRODUCTION

E STIMATING gene regulatory networks (GRNs) from mi-croarray data using reverse engineering is an important

problem in genomics [1]. We propose a GRN estimation methodfrom time-series microarray data using a Bayesian approach,

Manuscript received September 30, 2009; revised February 1, 2010. Date ofcurrent version June 3, 2010. Asterisk indicates corresponding author.

∗P. Sarder is with the Department of Electrical and Systems Engineer-ing, Washington University, St. Louis, MO 63130 USA (e-mail: [email protected]).

W. Schierding is with the Department of Genetics, Washington University,St. Louis, MO 63110 USA (e-mail: [email protected]).

J. P. Cobb is with the Center for Critical Illness and Health Engineering andthe Departments of Surgery and Genetics, Washington University School ofMedicine, St. Louis, MO 63110, USA (e-mail: [email protected]).

A. Nehorai is with the Department of Electrical and Systems Engineer-ing, Washington University, St. Louis, MO 63130 USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNB.2010.2043444

considering the regulatory interactions between the genes in thenetwork to be sparse. Our approach is motivated by the insightprovided by earlier studies on GRN, demonstrating that suchnetworks in most biological systems are typically sparse [2]–[5].

In our study, we use a precise physical model to describethe interactions among the genes, particularly1 we representthe putative regulatory relationship between the genes usingsparse linear coefficients or weights [6]. We estimate the weightsby employing a Bayesian linear regression method for sparseparameter vectors [7].

There exist several methods to model and estimate the relatedGRN from a microarray dataset, for example, GRN estimationsusing a genetic algorithm [8], a neural network [9], and Bayesianmodels [10] are well-known classical methods. In another earlyintriguing approach, Vogelstein et al. [11] and Marcotte [12]model the GRN using patterns similar to complex Webs or elec-trical circuits. Yeung et al. [13] propose linear models to describethe gene interactions, and use singular value decomposition toestimate the GRN. In a similar study, di Bernardo et al. [14]develop an expectation–maximization algorithm to estimate theGRN. Currently, the three most well-established approaches inthe community are [15]: 1) an ordinary differential-equation-based method, known variously as network identification bymultiple regression (NIR) or time series network identifica-tion (TSNI) [16], [17]; 2) a dynamic Bayesian-network-basedmethod, known as BANJO [18]; and 3) information theoreticapproaches [19]–[21]. In a recent study, Cantone et al. [15]build a synthetic network called In-vivo Reverse-engineeringand Modelling Assessment (IRMA), consisting of five genes,for assessing the performance of the existing GRN estimationapproaches. They conclude that NIR (TSNI) and BANJO areable to correctly infer the regulatory interactions from the time-series experimental data of their implemented IRMA network.

We, therefore, compare the performance of our proposedmethod with NIR (TSNI) and BANJO using the two publishedtime-series datasets in [15]. Our method outperforms the othertwo methods in one dataset, and shows a performance competi-tive with them in the other. There are two immediate advantagesof our method compared to the other two methods, as we under-stand in the performance comparison. First, our method is simpleand automatic, and it does not require additional experiments toestimate the model parameters; in contrast, NIR (TSNI) requiresadditional experiments to estimate the complex kinetics param-eters involved in its related model for estimation [15]. Second,the performance of our method should not be affected by thenumber of time points in the time-series microarray data that weanalyze; in contrast, improving the performance of BANJO re-quires collection of additional time-series measurements. Note

1536-1241/$26.00 © 2010 IEEE

122 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 9, NO. 2, JUNE 2010

that IRMA is a small network. We are, however, well groundedin applying our method to large microarray datasets, since thealgorithms that perform well with small networks also performwell with larger networks [15], [20], [22].

We further evaluate the performance of our proposed methodby applying it to the in silico time-series datasets, which con-sist of 50 and 100 genes, reported recently in the DREAM3challenge [23], [24]. These datasets represent gene interactionsunder several external perturbations, and our method performssignificantly with respect to their ground-truth GRNs.

We apply our proposed method to the extraction of differ-ential gene expression software (EDGE) [25] selected genesof a human circulating white blood cell (“buffy-coat”) mi-croarray dataset of ventilator-associated pneumonia (VAP) [26].We analyze the common VAP network edges that we estimateusing our method, and that McDunn et al. [26] estimate us-ing a correlation-coefficient-based method and a commercialdatabase-based method ingenuity pathway analysis (IPA) [27].We hypothesize here that the common edges, based on a consen-sus between the networks that our proposed and the correlation-coefficient-based methods predict, and with significantly highabsolute weights that our proposed method estimates, are sig-nificant. However, to our knowledge, the two computationalnetworks, the consensus network, and the IPA are each equallyvalid. There are definite advantages to supplementing resultsfrom the consensus network with the analytic capabilities ofIPA. By doing so, we obtain four biologically meaningful sub-networks from the consensus network. We then further applyour method to the top-ranked BATS [28] selected genes of theVAP dataset. Our method here performs significantly with re-spect to the GRN, estimated using a mutual-information-basedmethod [21], and thus reconfirms its own validity.

The paper is organized as follows. In Section II, we firstdiscuss the biophysical model [6] and our proposed biostatisticalmodel to describe time-series gene microarray data. We thenbriefly discuss our strategy to estimate the GRN. In Section III,we illustrate the results obtained in comparing our proposedmethod with NIR (TSNI) and BANJO. In Section IV, we presentthe results obtained by applying our method to the DREAM3in silico challenge data. In Section V, we present an analysisof the VAP dataset using our proposed method. Finally, weconclude in Section VI.

II. ESTIMATION

In this section, we first briefly review the biophysical modelpresented in [6], and then, propose a corresponding biostatisticalmodel to describe a time-series gene microarray data. We thendiscuss our strategy to estimate the GRN.

A. Biophysical Model

We denote the measured expression of the kth gene of the ithpatient or experiment at the time-point tj as xi

k (tj ), where k ∈{1, 2, . . . , G}, i ∈ {1, 2, . . . , P}, and j ∈ {1, 2, . . . , N}. Here,the measurement from one patient or experiment to the other isindependent with each other. We then denote the total regulatoryinput to the kth gene by all other genes as ri

k (tj ). In vector form,

Fig. 1. Example of the dose-response function for the kth gene with αk = 1and βk = 0, where ri

k (tj ) is the input and sik (tj+1 ) is the output.

we define the gene expression for the ith patient or experiment attime-point tj as xi(tj ) = [xi

1(tj ), xi2(tj ), . . . , x

iG (tj )]T and the

total regulatory input as ri(tj ) = [ri1(tj ), r

i2(tj ), . . . , r

iG (tj )]T ,

where T denotes the matrixn transpose operation.We denote the gene regulation matrix as W of size G × G,

with elements wkl for k, l ∈ {1, 2, . . . , G}. The net regulatoryeffect of the lth gene on the kth gene at time t is simply theexpression level xi

l (t) of the lth gene multiplied by its regulatoryinfluence wkl on the kth gene. The total regulatory input ri

k (t)to the kth gene is derived by summing across all the genes inthe system [6]:

rik (t) =

∑l

wklxil (t). (1)

Hence, we have

ri(t) = Wxi(t). (2)

A positive or negative value of wkl models the lth gene stimu-lating or repressing the expression of the kth gene, while a zerovalue indicates that the lth gene does not influence the tran-scription of the kth gene. Thus, in addition to zero values, eachgene has multiple inputs, both positive and negative, of differingstrengths.

The response of each gene to the regulatory input rik (t) is

given by using a dose-response function [6]

sik (tj+1) =

11 + exp[−(αkri

k (tj ) + βk )],

0 < sik (tj+1) < 1 (3)

where αk and βk are two gene-specific constants defining thedose-response function and si

k (tj+1) denotes the relative ex-pression level of the kth gene at time tj+1 , see Fig. 1. Briefly,each gene has a static dose-dependent response to activate andrepress regulatory influences. The parameter αk can be anypositive real number that defines the slope of the curve at itsinflection point. Genes with a large αk will shift rapidly from

SARDER et al.: ESTIMATING SPARSE GENE REGULATORY NETWORKS USING A BAYESIAN LINEAR REGRESSION 123

Fig. 2. Biophysical model (shown in the dotted box) fitting between the observations xi (tj ) and xi (tj+1 ) [6].

near zero expression to near maximal expression, when the ac-tivating input surpasses some gene-specific thresholds, whilethose with small αk will have a nearly linear response over thebiologically relevant range of regulatory inputs. The βk constantcan be any real constant, and it represents the gene’s basal levelof expression. A positive βk represents a gene with high basallevels of transcription, and negative βk represents the converse.

We convert the relative expression level sik (tj+1) to the ac-

tual expression level for the kth gene by multiplying it with itsmaximal expression mk (∀k ∈ {1, 2, . . . , G}). Hence, the mea-sured expression level xi

k (tj+1) of the kth gene at time tj+1is approximately equivalent to mksi

k (tj+1). This is shown inFig. 2, where m = [m1 ,m2 , . . . , mG ]T . Thus, for the kth geneof the ith patient or experiment at time tj+1 , we have

− ln[ mk

xik (tj+1)

− 1]≈ αk

∑l

wklxil (tj ) + βk

⇒ yik (tj+1) ≈ αk

∑l

wklxil (tj ) + βk (4)

where we define yik (tj+1) = − ln[ mk

xik(tj + 1 ) − 1].

B. Biostatistical Model

We propose a biostatistical model related to the biophysicalmodel described in Section II-A, namely, the dose-responsecurve imperfectly models the biological response of a gene,and hence, the model fitting in (4) incurs some error, which weconsider to be biological noise. We introduce this error whilefitting αk

∑l wklx

il (tj ) + βk to yi

k (tj+1) in (4). Thus, we have

yik (tj+1) = αk

∑l

wklxil (tj ) + βk + ei

k (tj+1) (5)

where eik (tj+1) is the model-fitting error, distributed as an addi-

tive Gaussian random variable with mean zero and variance σ2e ,

independently and identically distributed (i.i.d.) from measure-ment to measurement for k ∈ {1, 2, . . . , G}, i ∈ {1,2, . . . , P},and j ∈ {1, 2, . . . , N − 1}. Thus, for the ith patient or experi-ment, we have from (5)

yi(tj+1) = W ′xi(tj ) + β + ei(tj+1) (6)

where yi(tj ) = [yi1(tj ), y

i2(tj ), . . . , y

iG (tj )]T , W ′ is a G × G

matrix with elements w′kl = αkwkl for k, l ∈ {1, 2, . . . , G},

β = [β1 , β2 , . . . , βG ]T , and ei(tj+1) = [ei1(tj+1), ei

2(tj+1),. . . , ei

G (tj+1)]T .We express the kth (∀k ∈ {1, 2, . . . , G}) entry of the vector

yi(tj+1) in (6) as follows:

yik (tj+1) = [ xi

1(tj ) xi2(tj ) · · · xi

G (tj ) 1 ]

w′k1

w′k2...

w′kG

βk

+ eik (tj+1). (7)

We then stack yik (tj+1) for the ith patient or experiment (∀i ∈

{1, 2, . . . , P}) at time-points t ∈ {t1 , t2 , . . . , tN }, and obtain

yik = X ihk + ei

k (8)

where yik = [yi

k (t2), yik (t3), . . . , yi

k (tN )]T

X i =

xi1(t1) xi

2(t1) · · · xiG (t1) 1

xi1(t2) xi

2(t2) · · · xiG (t2) 1

...... · · ·

......

...... · · ·

......

xi1(tN −1) xi

2(tN −1) · · · xiG (tN −1) 1

hk = [w′k1 , w

′k2 , . . . , w

′kG , βk ]T , and ei

k = [eik (t2), ei

k (t3),. . . , ei

k (tN )]T . We further stack yik for all patients or experi-

ments i ∈ {1, 2, . . . , P}, and obtain

y1k

y2k......

yPk

=

X1

X2

...

...

XP

hk +

e1k

e2k......

ePk

⇒ yk = Xhk + ek (9)

with yk and ek being vectors of size P (N − 1), hk a vectorof size (G + 1), and X a matrix of size P (N − 1) × (G +1). We assume in our analysis that the number of patients orexperiments P and the number of time-points N are such thatP (N − 1) > (G + 1).


C. Estimating the GRN

We aim to estimate the GRN, i.e., the unknown gene regula-tion weight matrix W , using the linear model (9) for each k =1, 2, . . . , G. In addition, we also aim to estimate the unknown pa-rameters αk and βk of the kth (∀k ∈ {1, 2, . . . , G}) gene, whichare nuisance parameters. Here, recall from the previous sectionthat αk and the elements wkl (∀k, l ∈ {1, 2, . . . , G}) of W arenot separable from each other for a given k, and thus, we esti-mate them together as w′

kl = αkwkl without loss of generality.Therefore, the unknown parameters we estimate in our analysisare w′

kl (∀k, l ∈ {1, 2, . . . , G}) and βk (∀k ∈ {1, 2, . . . , G});note that these parameters are the elements of the vectors hk

(∀k ∈ {1, 2, . . . , G}) in (9).Hence, we estimate hk from (9) using the Bayesian linear

regression method proposed in [7] for sparse parameter vec-tors. Here, we assume that hk (∀k ∈ {1, 2, . . . , G}) are sparsevectors with the locations of the zeros unknown. This assump-tion is motivated by earlier studies [2]–[5], which concludedthat GRNs in most of the biological systems are sparse. Notethat we describe the GRN using wkl (∀k, l ∈ {1, 2, . . . , G}),which, in turn, are related with the elements of the vector hk

(∀k, l ∈ {1, 2, . . . , G}) through w′kl = αkwkl , see Section II-B,

and so, we assume that hk (∀k ∈ {1, 2, . . . , G}) are sparse.

III. PERFORMANCE COMPARISON WITH

EXISTING APPROACHES

In this section, we compare the performance of our proposedGRN estimation method with the NIR (TSNI) [16], [17] andBANJO [18] methods, the two most well-established meth-ods for estimating GRNs from time-series microarray data.Here, we employ the two time-series datasets of the IRMAnetwork, recently published in [15], implemented for assessingperformances of the existing GRN estimation methods. Cantoneet al. [15] conclude that the GRN estimation methods NIR(TSNI) and BANJO are able to correctly estimate the regulatoryinteractions from their collected time-series datasets. Therefore,we compare the performance of these two methods with ourmethod for assessing our performance.

As shown in Fig. 3, the IRMA network is implementedfor the yeast Saccharomyces cerevisiae, considering five genes(SWI5, ASH1, CBF1, GAL4, and GAL80) regulating each otherthrough a variety of interactions, without being affected by en-dogenous genes. The two time-series datasets in several inde-pendent experiments were collected by “switching” genes ON

or OFF by culturing the yeast cells in a galactose medium ora glucose medium in 14 uniformly spaced time-points, respec-tively [15]. Cantone et al. [15] name these datasets “switch on”and “switch off.” They average these datasets over the indepen-dent experiments [15], and we use them to estimate the GRNwith our proposed method. See [15] for a detailed description,and for brief reviews on the two GRN estimation methods NIR(TSNI) and BANJO.

Here, in estimating the GRN using our proposed method, weuse ∆Ct values of the published datasets in [15] as the gene

Fig. 3. Ground-truth IRMA network. Red and blue lines signify stimulationand repression, respectively, from one gene to the other in the connecting edges.

expression levels, and compute mk for the kth gene in (4) as

mk = maxi,j xik (tj ). (10)

In Fig. 4, we present the GRN estimation results using theswitch on and switch off time-series datasets for the ground-truthnetwork shown in Fig. 3, namely, Fig. 4(a) and (c) shows theGRN estimation results for the switch-on time-series datasetusing the algorithms NIR (TSNI) and our proposed one, re-spectively. Similarly, in Fig. 4(b) and (d), we present the GRNestimation results for the switch-off time-series dataset usingthe algorithms NIR (TSNI) and our proposed one, respectively.

We quantify the performance in terms of the ratio of thecorrectly predicted edges out of the total number of predictededges [i.e., positive predicted value (PPV)], and in terms of theratio of all the true edges that have been correctly identified bythe employed algorithm [i.e., sensitivity (Se)] [20].

We test the significance of the employed algorithm by com-paring its PPV with the PPV of a “random” performance thatrandomly assigns edges between a pair of genes in the estimatednetwork. For example, for a fully connected network, the ran-dom algorithm would have a PPV = 1 for all sensitivity levels,as any pair of genes is connected in the network. In the networkshown in Fig. 3, the expected PPV for a random guess of di-rected edges among genes is PPV = 0.4, so any value higherthan 0.4 will be significant.

We outperform the algorithms NIR (TSNI) and BANJO forthe switch-off time-series datasets, see Fig. 4(b) and (d); BANJOshows similar performance to NIR (TSNI) [15]. For the switch-on time-series dataset, we outperform BANJO,- and obtain per-formance competitive to NIR (TSNI). That is to say, NIR (TSNI)performs worse than our proposed method in terms of Se, and itperforms better than our method in terms of PPV; the BANJOalgorithm fails to perform better than the random algorithm [15].Also note that we consider in the performance comparison onlythe accuracy in estimating the directed edges using our method,and not the accuracy in estimating whether the directed edgesare stimulating or repressing.

During the performance comparison, we realize two imme-diate advantages of our proposed method in comparison withthe other two methods. First, our method is simple and auto-matic, and it does not require additional experiments to estimate


Fig. 4. Estimated GRN using NIR (TSNI) for: (a) the switch-on dataset and (b) the switch-off dataset. Estimated GRN using our proposed method for: (c) theswitch-on dataset and (d) the switch-off dataset. Red and blue lines signify stimulation and repression, respectively, from one gene to the other in the connectingedges. The dashed lines signify a false estimated connection, see the ground truth shown in Fig. 3.

the model parameters; in contrast, NIR (TSNI) requires addi-tional experiments to estimate the complex kinetics parametersinvolved in its related model for estimation [15]. Second, the per-formance of our method should not be affected by the number oftime-points in the time-series microarray data that we analyzeas long as P (N − 1) > (G + 1) and σ2

e is negligible in (5), seeSection II; in contrast, improving the performance of BANJOrequires collection of additional time-series measurements. Weare also well grounded in applying our method to microarraydatasets larger than IRMA, since the algorithms that performwell with the small networks, such as IRMA, also perform wellwith larger networks [15], [20], [22], see also Section IV.

IV. DREAM3 In Silico CHALLENGE NETWORK ANALYSIS

We apply our proposed method to the in silico time-seriesdatasets, consisting of 50 and 100 genes, reported recently inthe DREAM3 challenge [23], [24]. Here, we employ: 1) thefive datasets for the 50-gene networks and 2) the five datasetsfor the 100-gene networks—both described in the DREAM3challenge. Each dataset represents the specific gene interactionsof Escherichia coli or yeast under several external perturba-

tions or in several independent experiments. The datasets werecollected in 23 independent experiments for the 50-gene net-works and in 46 independent experiments for the 100-gene net-works. We compare the directed edges of our estimated GRNswith the ground-truth GRNs for each of these datasets, andobtain for each of them a higher value of the PPV using ourmethod than we obtain using the random algorithm. Recall fromSection III that if a method results in a higher value of the PPVthan is obtained using the random algorithm, then that method issignificant. We thus conclude that our method performs signifi-cantly with respect to the ground-truth GRNs for the DREAM3in silico challenge datasets, see Tables I and II.

We summarize two key observations from our simulations forthe results presented in Tables I and II. First, we find that theperformance of our proposed method improves as we increasethe number of independent experiments in our simulations, i.e.,with increasing P in (9). We thus understand that the in silicodatasets that we use here are very noisy. Second, we estimatemany false edges for all of the in silico datasets. This would bethe case when the datasets we employ are very noisy. Note thatour method is purely computational, and requires a prior knowl-edge of the sparsity and the gene-expression variation [7]. We


TABLE IPERFORMANCE FOR THE 50-GENE NETWORK

TABLE IIPERFORMANCE FOR THE 100-GENE NETWORK

currently replace these parameters with their empirical Bayesianestimates obtained from the data [7], and such estimates are of-ten very inaccurate for noisy data. One possible way to improvethe accuracy in estimating these parameters is using their knowl-edge from biology; as a result, the performance of our methodimproves.

V. VAP DATA ANALYSIS

VAP refers to a pneumonia that develops in patients whorequire mechanical ventilation through an endotracheal or tra-cheostomy tube. This disease is a common, severe, and costlycomplication that increases the likelihood of death from infec-tion [29].

A. Experiment

Circulating white blood cell microarray expression profileswere generated from 20 mechanically ventilated patients everytwo days for up to three weeks [26]. Eleven of these patients de-veloped VAP, as diagnosed by the attending physician [26]. Wepresent the data collection days of these patients in Table III,with day zero being the day that a patient was diagnosed ashaving VAP. Using EDGE software, 85 mRNA species wereidentified [25], with a false discovery rate of 0.1, whose abun-dance changed concordantly among the patients during the datacollection window surrounding the day zero data [26]. We alsoidentify 50 mRNA species from the average of the microar-ray expression profiles, which appear at the top of the orderedlist, when the genes are ranked based on their differential ex-pressions using a gene selection software BATS (a Bayesianuser-friendly software for analyzing time-series microarray ex-

TABLE IIIDATA COLLECTION DAYS OF THE 11 VAP PATIENTS

Fig. 5. Venn diagram showing the number of common edges in any combi-nation of the three methods, i.e., our proposed, and the correlation-coefficient-and IPA-based methods, and the number of edges that appear in only a singlemethod, but not in the other ones.

periments) [28]. We use, in this ranking, the default parametersthat are set in the BATS analysis window.

B. Analysis Using the 85 EDGE Selected Genes

We estimate the GRN for the 85 EDGE selected genes usingour proposed method, and compare the resulting directed edgeswith those of McDunn et al. [26], using both a correlation-coefficient-based method [19] and the IPA [27] based methodestimate. (IPA is a knowledge-based software that estimates theGRN based on the gene–gene interactions reported in the liter-ature.) The goal of our comparison is to quantify the similarityor difference among the networks that the three methods (i.e.,our method, the correlation-coefficient-based method, and theIPA-based method) derive. We expect that the common edgeswould be the most likely to be biologically significant. Notethat we can perform a similar analysis using the 50 BATS se-lected genes. We, however, do not include such analysis herefor brevity.

1) Analysis–1: Our proposed method estimates 4256 net-work edges with nonzero weights, out of a possible 7225 net-work edges. The correlation-coefficient- and IPA-based methodsestimate 613 and 44 network edges, respectively. In Fig. 5, wepresent a venn diagram to show the number of common edges inany combination of these three methods and the number of edgesthat appear in only a single method. We see that all the three


Fig. 6. Common GRN estimated using our proposed and the correlation-coefficient-based methods. Red and blue color lines signify stimulation and repression,respectively, from one gene to the other in the connecting edges.

methods overlap. Our method and the correlation-coefficient-based method exhibit 321 common network edges.

There is only one edge, between CCR1 and IL-1 beta, com-mon to all the three estimated networks. This edge is well doc-umented in the literature [27], [30]. However, we estimate thisedge with a weight near zero using our proposed method, andhence, we consider this edge insignificant.

Neither the correlation-coefficient nor our proposed methodshares a significant number of edges with the network derivedfrom the reported gene–gene associations in the IPA [27]. Fur-thermore, the common edges between the GRNs that our methodand the IPA-based method estimate are estimated to be veryweak using our method. Hence, we consider that these edgesare also insignificant.

We thus focus on the 321 edges that appear in the GRNs thatboth our proposed and the correlation-coefficient-based meth-ods estimate. We use the weight values of these network edges,which our method estimates, as metrics for measuring theirsignificance. For simplicity, we eliminate the edges with ab-solute estimated weights of less than 10% of the maximumabsolute weight value of the common 321 branches, and obtaina consensus network of 118 branches. In Fig. 6, we presentthe consensus GRN, which resembles a directed graph, usingthese 118 branches. Here, a positive edge weight (denoted byred) signifies stimulation from one gene to the other, and anegative value (denoted by blue) signifies repression. The net-work in Panel A contains genes that are downregulated, i.e.,their RNA abundances are decreasing, and the network in PanelB contains genes that are upregulated, i.e., their RNA abun-dances are increasing. Here, note that upregulation and down-regulation are classified using the correlation-coefficient-basedmethod.

Next, we briefly summarize the consensus GRN analysisscheme, which is also represented in Fig. 7.

1) Select the significant genes of the microarray dataset ofinterest using EDGE.

2) First, estimate the GRN using our proposed method. Re-call that the proposed method assumes the network issparse, and estimates the network edges with nonzeroweights. Moreover, our method estimates the directionsof the estimated edges and classifies whether the connec-tions signify stimulation or repression.

3) Next, estimate the GRN using the correlation-coefficient-based method proposed in [19]. The correlation-coefficient-based method generates an undirected net-work, but classifies whether the network is upregulatoryor downregulatory [19].

4) Then, obtain a consensus network using the commonedges of the GRNs that our and the correlation-coefficient-based methods estimate. This consensus network is sparse,directed, and classified as upregulatory or downregulatory.The edges of this network are classified as stimulating orrepressing from one gene to the other, and are weightedto specify the degree of stimulation or repression. Onecan apply a threshold to the edge weights to reduce thisconsensus network, but that is not necessary.

5) Use IPA to define the molecular and functional classifica-tions of the consensus network and generate hypotheses.

The lack of correlation between the consensus network andthe IPA-based network suggests a fundamental concern thatcannot be resolved from the current data, given our inability tovalidate either approach. Note that the IPA-based method suffersfrom its inability to restrict gene–gene connections based uponspecies or cell phenotype; it is also unable to connect genes of


Fig. 7. Block diagram of the consensus GRN analysis scheme.

unknown functions, which is, however, one of the strengths ofthe other two methods.

2) Analysis–2: To gain biological insights, we eliminate thenetwork edges with absolute estimated weights of less than90% of the maximum absolute weight of the common 321edges, which our proposed and the correlation-coefficient-basedmethods estimate. Thus, we obtain four subnetworks with thestrongest connection weights, as shown in Fig. 8. They containseveral interesting interactions. Two of them are upregulatory(networks A and B), and the other two are downregulatory.

3) Network A: This is the smallest network. It exhibitstwo-gene interaction, performs cell attraction [pro-platelet ba-sic protein (PPBP), chemotaxis], and negatively controls cellgrowth and differentiation (PPP2R5A) [27]. This network con-tains PPBP as the most important gene, which is a growthfactor that is a potent chemoattractant and activator of neu-trophils [27], [30].This gene is also known as B cell translo-cation gene (BTG), and a decrease of the concentration ratiobetween BTG and PF4 is a sensitive indicator of in vitro plateletactivation [27], [30].

4) Network B: This network exhibits interaction betweenthree genes: inhibin beta A (INHBA), cathelicidin antimicrobialpeptide (CAMP), and Fms-related tyrosine kinase 1 (FLT1).Here, the central node mRNA is INHBA, which is postulatedto be a growth/differentiation factor. It has a beta A subunitmRNA, which is identical to the erythroid differentiation factorsubunit mRNA [27], [30]. Note that INHBA is heavily studied

in cancer research [27], [30], and has only one gene that existsin the human genome [27], [30]. The second gene, CAMP, is anantimicrobial protein found in granulocytes [27]. It is reportedas one of the strongest nodes by McDunn et al. [26] in theirGRN analyses performed for the current 85 human mRNA VAPdataset and for a 219 mouse mRNA dataset. The last gene,FLT1, is a protein kinase that is important for the control of cellgrowth/differentiation [27].

5) Network C: This network exhibits a three-gene interac-tion. The central node here is the chemokine receptor (CCR1),which is the only gene node common to all the three networkanalyses considered for the VAP dataset, as shown in Fig. 6. Inthis network, the mRNA LIMK2 mediates interactions for theactin cytoskeleton, which is reminiscent of ACTR3 in networkD [27]. Also, network C provides an interaction (albeit indi-rect) between the actin cytoskeleton (LIMK2) and inhibition ofapoptosis (BIRC1) [27].

6) Network D: This is the largest network with eight genes.Here, the central node is the actin-related protein 3 (ACTR3),whose function is largely unknown, but which is related to nu-cleotide binding and cell motility [27]. Actin polymerization ofACTR3 is mediated by the Arp2/3 complex, which has beenfound to play a critical role in some pathogens’ intracellularmotility [27], [30]. This gene inhibits the HIV-1’s ability to ac-tivate the Arp2/3 complex, and could be a potential chemother-apeutic intervention for acquired immunodeficiency syndrome(AIDS) [27], [30]. It also performs actin polymerization of the


Fig. 8. Subnetworks formed using the most powerful stimulating and repressing connections. Red and blue color lines signify stimulation and repression,respectively, from one gene to the other in the connecting edges.

host cell as necessary to combat for parasite infection [27], [30].The other genes in network D perform diverse cell functions,including inhibition of apoptosis, cell interaction and differen-tiation, and ion transport [27], [30].

7) Discussion: To summarize, we generate the following ex-perimental tasks for future research using conventional molec-ular techniques.

1) The interaction between CCR1 and IL-1beta is the onlyoverlap between all the three network analyses consid-ered for the VAP dataset. This interaction is already welldocumented in the literature [27], [30]; however, we esti-mate this connection to be very weak using our proposedmethod. Thus, it would be of interest to validate the im-portance of this interaction, particularly for optimized hostresponses to bacterial pathogens.

2) The gene ACTR3 has many connections with the othergenes, see Fig. 6. The function of this gene is largelyunknown in the literature, especially with response to thehost–pathogen interactions [27]. We will investigate thisfurther in our future work.

3) Networks C and D both have similar edges, implying actinand apoptosis interaction, which is important to the host re-sponse to pneumonia, see Fig. 8. This interaction is definedas actin-mediated apoptosis in the literature [30]. Note thatthere is a close association between the actin organizationand mitochondrial function, namely, the actin cytoskele-ton has been implicated in the movement of mitochon-

dria along the actin filaments via an ARP2/3-dependentpropulsion mechanism [31]. Furthermore, a link betweenthe actin-regulatory protein gelsolin and mitochondriallyinduced apoptosis has also been described in mammaliancells [32], [33]. A reduction in the actin dynamics causedby mutations in the actin itself leads to the developmentof apoptotic phenotypes, such as a loss of mitochondrialmembrane potential, elevated reactive oxygen species lev-els, and DNA fragmentation [34]. Our current computa-tional finding supports these studies for the human genomeVAP dataset; however, neither the correlation-coefficient-based method nor our proposed method estimate any con-nection between LIMK2 and ACTR3, suggesting the needfor further experimental verification in our future work.

C. Analysis Using the 50 BATS Selected Genes

We compare our proposed method with a mutual-information-based method in estimating the GRN [21] using the 50 BATSselected genes. The software BATS is developed to select thedifferentially expressed genes from their time-series measure-ment [28]. Our method estimates the GRN from time-seriesmicroarray datasets, and thus, the resulting estimated GRNfrom the BATS selected genes should be important. On theother hand, the mutual-information-based method for estimat-ing the GRN is new and well established in the community [21].Thus, it is interesting to compare the directed edges of the GRN


that is estimated using the mutual-information-based methodwith those estimated using our method. We consider here thatthe GRN that is estimated using the mutual-information-basedmethod is the ground-truth GRN.

To compute the GRN using the mutual-information-basedmethod, we employ the time-series microarray dataset of pa-tients 1, 2, 3, 5, 6, 7, 9, and 10, at data collection days −3, −1,1, and 3, with day zero being the day that a patient was diag-nosed as having VAP, see Section V. We use the data from thesecollection days as they are common among most of the patients.We then average these data over the patients and use the aver-age dataset to compute the GRN using the mutual-information-based method [21]. To compute the GRN using our proposedmethod, we employ all the available time-series microarrayexpression profiles from all the patients; see Section V andTable III.

We obtain the PPV= 0.07 using our method rather than thePPV= 0.05 using the random algorithm, and obtain Se= 0.17using our method.Recall from Section III that if a method resultsin a higher value of the PPV than is obtained using the randomalgorithm, then that method is significant. We thus conclude thatour method performs significantly with respect to the mutualinformation based method.

VI. CONCLUSION

In this paper, we proposed a new GRN estimation methodderived from time-series gene-expression datasets. We de-scribed the time-series gene-expression data using the biophys-ical model suggested in [6], and proposed a related biostatisticalmodel. Motivated by past studies indicating that most biologicalnetworks are typically sparse [2]–[5], we assumed that the GRNis sparse, and estimated it using a Bayesian linear regressionmethod for sparse parameter vectors [7].

We compared the performance of our proposed method withNIR (TSNI) and BANJO, the two most well-established meth-ods in the literature for estimating the GRN from a time-series gene-expression data. For this comparison, we employedtwo time-series gene-expression datasets, recently publishedin [15], collected for assessing performances of the existingGRN estimation methods. The proposed method outperformedthe methods NIR (TSNI) and BANJO in one dataset, and ob-tained a performance competitive with them in the other. Weevaluated further the performance of our method using thein silico time-series datasets consisting of 50 and 100 genes,reported recently in the DREAM3 challenge [23], [24], and ourmethod performs significantly with respect to their ground-truthGRNs.

We also estimated the GRN for a VAP dataset publishedin [26] using our proposed method, and compared the re-sult with the GRNs that McDunn et al. [26] estimated usingcorrelation-coefficient-based [19] and IPA-based [27] methods.We found that the GRNs estimated using our and the correlation-coefficient-based methods display a large number of commonedges. These two methods are unbiased and independent of eachother, and hence, their large overlap is interesting. The resultingconsensus network is sparse, directed, and classified as upreg-

ulatory or downregulatory. The edges of this network are clas-sified as stimulating or repressing, and their estimated weightsusing our method specify their degree of stimulation or repres-sion. We then analyzed the consensus network using IPA todefine molecular and functional classifications and generate hy-potheses, and obtained four biologically meaningful networks.Using the top-ranked BATS [28] selected genes of the VAPdataset, we further found that our method performs significantlywith respect to the GRN estimated using a mutual-information-based method [28], and thus reconfirmed the validity of ourmethod.

In our future experimental work, we will use conventionalmolecular analysis to analyze the four biologically meaningfulsubnetworks obtained by comparing the GRNs estimated fromthe VAP dataset using our method, the correlation-coefficient-based method, and the IPA-based method.

ACKNOWLEDGMENT

The authors would like to acknowledged the Systems Anal-ysis Group, Washington University, St. Louis, for providingthem valuable feedback to improve this paper. They also wouldlike to thank the anonymous reviewers for their constructivecomments.

REFERENCES

[1] J. L. DeRisi, V. R. Iyer, and P. O. Brown, “Exploring the metabolic andgenetic control of gene expression on a genomic scale,” Science, vol. 278,pp. 680–686, 1997.

[2] M. I. Arnone and E. H. Davidson, “The hardwiring of development: Or-ganization and function of genomic regulatory systems,” Development,vol. 124, pp. 1851–1864, 1997.

[3] D. Thieffry, A. Huerta, E. Perez-Rueda, and J. Collado-Vides, “From spe-cific gene regulation to genomic networks: A global analysis of transcrip-tional regulation in Escherichia coli,” BioEssays, vol. 20, pp. 433–440,1998.

[4] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Barabasi,“The large-scale organization of metabolic networks,” Nature, vol. 407,pp. 651–654, 2000.

[5] H. Jeong, S. Mason, A.-L. Barabasi, and Z. N. Oltvai, “Lethality andcentrality in protein networks,” Nature, vol. 411, pp. 41–42, 2001.

[6] D. C. Weaver, C. T. Workman, and G. D. Stormo, “Modeling regulatorynetworks with weight matrices,” Pac. Symp. Biocomput, vol. 4, pp. 112–123, 1999.

[7] E. G. Larsson and Y. Selen, “Linear regression with a sparse parametervector,” IEEE Trans. Signal Process., vol. 55, no. 2, pp. 451–460, Feb.2007.

[8] M. Wahde and J. Hertz, “Coarse-grained reverse engineering of geneticregulatory networks,” BioSystems, vol. 55, pp. 129–136, 2000.

[9] P. Dhaeseleer, S. Liang, and R. Somogyi, “Genetic network inference:From co-expression clustering to reverse engineering,” Bioinformatics,vol. 16, pp. 707–726, 2000.

[10] A. Hartemink, D. Gifford, T. Jaakkola, and R. Young, “Using graphicalmodels and genomic expression data to statistically validate models ofgenetic regulatory networks,” Pac. Symp. Biocomput., vol. 6, pp. 422–433, 2001.

[11] B. Vogelstein, D. Lane, and A. J. Levine, “Surfing the p53 network,”Nature, vol. 408, pp. 307–310, 2000.

[12] E. M. Marcotte, “The path not taken,” Nat. Biotechnol., vol. 19, pp. 626–627, 2001.

[13] M. K. Stephen Yeung, J. Tegner, and J. J. Collins, “Reverse engineeringgene networks using singular value decomposition and robust regression,”in Proc. Nat. Acad. Sci. USA, 2002, vol. 99, pp. 6163–6168.

[14] D. di Bernardo, M. J. Thompson, T. S. Gardner, S. E. Chobot, E. L.Eastwood, A. P. Wojtovich, S. J. Elliott, S. E. Schaus, and J. J. Collins,“Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks,” Nat. Biotechnol., vol. 23, pp. 377–383, 2005.


[15] I. Cantone, L. Marucci1, F. Iorio, M. Aurelia Ricci, V. Belcastro,M. Bansal, S. Santini, M. di Bernardo, D. di Bernardo, and M. P. Cosma,“A yeast synthetic network for in vivo assessment of reverse-engineeringand modeling approaches,” Cell, vol. 137, pp. 172–181, 2009.

[16] G. Della Gatta, M. Bansal, A. Ambesi-Impiombato, D. Antonini, C. Mis-sero, and D. di Bernardo, “Direct targets of the TRP63 transcription factorrevealed by a combination of gene expression profiling and reverse engi-neering,” Genome Res., vol. 18, pp. 939–948, 2008.

[17] T. S. Gardner, D. di Bernardo, D. Lorenz, and J. J. Collins, “Inferringgenetic networks and identifying compound mode of action via expressionprofiling,” Science, vol. 301, pp. 102–105, 2003.

[18] J. Yu, V. A. Smith, P. P. Wang, A. J. Hartemink, and E. D. Jarvis, “Ad-vances to Bayesian network inference for generating causal networks fromobservational biological data,” Bioinformatics, vol. 20, pp. 3594–3603,2004.

[19] J. Ruan and W. Zhang, “Identification and evaluation of functional mod-ules in gene co-expression networks,” in Proc. RECOMB Satellite Conf.Syst. Biol. Comput. Proteomics, 2006, pp. 57–76.

[20] M. Bansal, V. Belcastro, A. Ambesi-Impiombato, and D. di Bernardo,“How to infer gene networks from expression profiles?” Mol. Syst. Biol.,vol. 3, p. 78, 2007.

[21] K. Basso, A. A. Margolin, G. Stolovitzky, U. Klein, R. Dalla-Favera, andA. Califano, “Reverse engineering of regulatory networks in human Bcells,” Nat. Genet., vol. 37, pp. 382–390, 2005.

[22] G. Stolovitzky, D. Monroe, and A. Califano, “Dialogue on reverse engi-neering assessment and methods: The DREAM of high-throughput path-way inference,” Ann. NY Acad. Sci., vol. 1115, pp. 1–22, 2007.

[23] (2010). [Online]. Available: http://wiki.c2b2.columbia.edu/dream/index.php/D3c4full

[24] D. Marbach, T. Schaffter, C. Mattiussi, and D. Floreano, “Generatingrealistic in silico gene networks for performance assessment of reverseengineering methods,” J. Comput. Biol., vol. 16, pp. 229–239, 2009.

[25] (2010). [Online]. Available: http://www.edge-software.com/[26] J. E. McDunn, K. D. Husain, A. D. Polpitiya, A. Burykin, J. Ruan,

Q. Li, W. Schierding, N. Lin, D. Dixon, W. Zhang, C. M. Coopersmith,W. M. Dunne, M. Colonna, B. Ghosh, and J. P. Cobb, “Plasticity of the sys-tematic inflammatory response to acute infection during critical illness,”PLoS One, vol. 3, p. e1564, 2008.

[27] (2010). [Online]. Available: http://www.ingenuity.com/products/pathways_analysis.html

[28] C. Angelini, L. Cutillo, D. De Canditiis, M. Mutarelli, and M. Pensky,“BATS: A Bayesian user-friendly software for analyzing time series mi-croarray experiments,” BMC Bioinf., vol. 9, p. 415, 2008.

[29] N. Safdar, C. Dezfulian, H. R. Collard, and S. Saint, “Clinical and eco-nomic consequences of ventilator-associated pneumonia: A systematicreview,” Crit. Care Med., vol. 33, pp. 2184–2193, 2005.

[30] (2010). [Onlne]. Available: http://www.ncbi.nlm.nih.gov/sites/entrez[31] I. R. Boldogh, H. Yang, W. D. Nowakowski, S. L. Karmon, L. G. Hays,

J. R. Yates, and L. A. Pon, “Arp2/3 complex and actin dynamics arerequired for actin-based mitochondrial motility in yeast,” in Proc. Nat.Acad. Sci. USA, vol. 98, pp. 3162–3167, 2001.

[32] S. Ohtsu, S. Izumi, S. Iwanaga, N. Ohno, and T. Yadomae, “Analysisof mitogenic substances in Bupleurum chinenese by ESR spectroscopy,”Biol. Pharm. Bull., vol. 20, pp. 97–100, 1997.

[33] D. Koya, M. Haneda, H. Nakagawa, K. Isshiki, and H. Sato, “Ameliorationof accelerated diabetic mesangial expansion by treatment with a PKC βinhibitor in diabetic db/db mice, a rodent model for type 2 diabetes,”FASEB J., vol. 14, pp. 439–447, 2000.

[34] C. W. Gourlay, L. N. Carpp, P. Timpson, S. J. Winder, and K. R. Ayscough,“A role for the actin cytoskeleton in cell death and aging in yeast,” J. CellBiol., vol. 164, pp. 803–809, 2004.

Pinaki Sarder (S’03) received the B.Tech. degreein electrical engineering from the Indian Institute ofTechnology, Kanpur, India, in 2003. He is currentlyworking toward the Ph.D. degree in the Departmentof Electrical and System Engineering, WashingtonUniversity, St. Louis (WUSTL), MO.

He is also a Research Assistant in the De-partment of Electrical and System Engineering,Washington University. His research interests includestatistical signal processing, biomedical imaging, andgenomics.

Mr. Sarder was a recipient of the Imaging Sciences Pathway Program Fellow-ship for Graduate Students at WUSTL from January 2007 to December 2007.

William Schierding, photograph and biography not available at the time ofpublication.

J. Perren Cobb graduated (cum laude) with the M.D.degree from the University of Louisville School ofMedicine, Louisville, KY.

He was a trainee in general surgery at the Uni-versity of California, San Francisco. He completedfellowships in critical care at the National Instituteof Health and the University of Pittsburgh. His in-vestigative work has been supported by the NationalInstitutes of Health, the American Association forthe Surgery of Trauma, the Society of Critical CareMedicine, and the Barnes-Jewish Hospital Founda-

tion. He is currently the Director of the Center for Critical Illness and HealthEngineering, Washington University School of Medicine, St. Louis, MO, wherehe is also a Professor of surgery and anesthesiology, and an Associate Profes-sor of genetics. His clinical specialty is surgical critical care with a researchinterest in the pathophysiology of sepsis and injury. His current grants focuson the application of functional genomics to critical illness and injury, and thedevelopment of novel sepsis diagnostics.

Prof. Cobb is the Chair of the Steering Committee for the U.S. CriticalIllness and Injury Trials Group. He was the President of the Association forAcademic Surgery. He was the recipient of the Research Scholarship Award ofthe American Association for the Surgery of Trauma, the Founders Grant forTraining in Critical Care Research of the Society of Critical Care Medicine, theGeorge H. A. Clowes, Jr., Memorial Research Career Development Award ofthe American College of Surgeons, and the 2nd Annual Critical Care MedicineDistinguished Alumnus of the University of Pittsburgh. In 2005, Genome Tech-nology Magazine named him a Microarray Innovator of the Year.

Arye Nehorai (S’80–M’83–SM’90–F’94) receivedthe B.Sc. and M.Sc. degrees in electrical engineer-ing from the Technion, Haifa, Israel, and the Ph.D.degree in electrical engineering from Stanford Uni-versity, Stanford, CA.

From 1985 to 1995, he was a faculty member inthe Department of Electrical Engineering, Yale Uni-versity. In 1995, he joined the Department of Elec-trical Engineering and Computer Science, Universityof Illinois, Chicago, as a Full Professor, where from2000 to 2001, he was the Chair of the department’s

Electrical and Computer Engineering (ECE) Division. In 2001, he was namedthe University Scholar of the University of Illinois. In 2006, he became the Chairof the Department of Electrical and Systems Engineering, Washington Univer-sity, St. Louis, MO, where he has been the inaugural holder of the Eugene andMartha Lohman Professorship, and the Director of the Center for Sensor Signaland Information Processing since 2006.

Dr. Nehorai has been a Fellow of the Royal Statistical Society since 1996. Hewas the Editor-in-Chief of the IEEE TRANSACTIONS ON SIGNAL PROCESSING

during 2000–2002. From 2003 to 2005, he was the Vice President (Publications)of the IEEE Signal Processing Society (SPS), the Chair of the PublicationsBoard, a member of the Board of Governors, and a member of the ExecutiveCommittee of this Society. From 2003 to 2006, he was the Founding Editor ofthe special columns on Leadership Reflections in the IEEE SIGNAL PROCESSING

MAGAZINE. He was the corecipient of the IEEE SPS 1989 Senior Award for BestPaper with P. Stoica, the coauthor of the 2003 Young Author Best Paper Award,and the corecipient of the 2004 Magazine Paper Award with A. Dogandzic. Hewas the Distinguished Lecturer of the IEEE SPS for the term 2004 to 2005. Hewas the recipient of the 2006 IEEE SPS Technical Achievement Award and the2010 IEEE SPS Meritorious Service Award. He is the Principal Investigator ofthe Multidisciplinary University Research Initiative project entitled AdaptiveWaveform Diversity for Full Spectral Dominance.

ieee transactions on nanobioscience, vol. …pinakisa/img/pubs/sarder_et_al...ieee transactions on...

Documents