from ‘differential expression’ to ‘differential networking’ – identification of...
TRANSCRIPT
From ‘differential expression’ to‘differential networking’ –identification of dysfunctionalregulatory networks in diseasesAlberto de la Fuente
CRS4 Bioinformatica, Polaris Edificio 3, Localita Piscina Manna, 09010 Pula (CA), Italy
Review
Glossary
Differential coexpression: the observation that the correlation (or other
measure of association) between the expression levels of two (or more) genes
is significantly different (higher or lower) in case (e.g. disease) and control (e.g.
healthy) samples.
Differential expression: the observation that the mean expression level of a
given gene (or set of genes) is significantly different (higher or lower) between
case and control samples.
False discovery rate (FDR): expected ratio of false positive discoveries over all
discoveries. For example, if the FDR is estimated to be 0.1 at a given statistical
threshold, then 10% of the discoveries can be expected to be erroneous.
Gene coexpression network: a network model in which the nodes are gene
activities and the edges represent significant associations between them.
Gene network: an abstract model of gene regulation in which the nodes are
gene activities and the edges represent causal influences among the genes
Understanding diseases requires identifying the differ-ences between healthy and affected tissues. Geneexpression data have revolutionized the study of dis-eases by making it possible to simultaneously considerthousands of genes. The identification of disease-associ-ated genes requires studying the genes in the context ofthe regulatory systems they are involved in. A major goalis to identify specific regulatory networks that are dys-functional in a given disease state. Although we stillhave not reached a stage where the elucidation of differ-ential regulatory networks is commonly feasible, recentadvances have described the first steps towards thisgoal – the identification of differential coexpression net-works. This review describes the shift from differentialgene expression to differential networking and outlineshow this shift will affect the study of the genetic basis ofdisease.
Dysfunctional networks in diseaseTo understand the roles of genes in complex human dis-eases, genes need to be studied in the context of theregulatory systems they are involved in [1]. Regulatorysystems inside cells can be effectively abstracted intonetworks (Box 1). Such regulatory networks hold thepotential to provide the cellular context of all genes ofinterest and give a means to identify specific subnetworksthat are dysfunctional in a given disease state. It is there-fore not surprising that several recent publications expli-citly consider gene expression data in the context ofbiomolecular networks. In particular, a wide variety ofways to identify protein interaction subnetworks contain-ing many differentially expressed genes in diseases havebeen proposed [2]. Other studies went beyond differentialmean expression and focused on differential coexpressionpatterns in diseases [3–12]. The idea behind theseapproaches is that the identification of changes in genecoexpression patterns between disease and healthysamples provides information about disease-affectedregulatory networks.
Here I review recent literature pioneering the study ofdysfunctional regulatory networks with a focus on meth-odologies used to identify differential coexpression pat-
Corresponding author: de la Fuente, A. ([email protected]).
326 0168-9525/$ – see front matter � 2010 Elsevier Lt
terns in disease gene expression studies. It should bementioned that differential correlations have also beenused withmetabolomics data to identify condition-specificalterations in metabolic pathways [13–17]. Indeed, differ-ential coexpression approaches could be equally applied todisease studies involving metabolomics and proteomicsdata. Moreover, these approaches are not limited to dis-ease studies and could be used for elucidating cell- andtissue-specific regulatory networks and changes associ-atedwith, for example, aging [18], or applied to other case–
control settings.Without going into technical details, Iwilloutline the main directions in which these efforts havebeen pursued. The differential networking methodologyrequires bringing together the forces of two commonapproaches to the analysis of gene expression data: differ-ential expression studies and network inference. Thesetwo approaches to gene expression analysis will be quicklyreviewed to provide a necessary background. Recentapproaches for differential coexpression analysis will bediscussed and additional routes towards identifyingdifferential regulatory networks in disease will besuggested.
Differential expression studies
Gene expression studies of disease are typically performedby comparing gene expression levels between diseased andhealthy tissues. This is usually done by testing the stat-istical significance of the changes in the mean level ofexpression of each individual gene [19]. To consider genes
(directed) and dependencies due to hidden (unobserved) confounding factors
(undirected).
d. All rights reserved. doi:10.1016/j.tig.2010.05.001 Trends in Genetics 26 (2010) 326–333
Box 1. Networks in a nutshell
Biomolecular regulatory systems consist of thousands of molecular
species of different chemical nature. These systems have been
described as networks, such as metabolic networks, protein-interac-
tion networks and transcriptional regulatory networks [37]. In these
networks the nodes represent biomolecular species (e.g. metabolites,
proteins, RNAs) and the edges represent functional, causal or physical
interactions between the nodes (Figure I).
The degree of connectivity k of a node is simply the number of
edges attached to it (or sum over the weights to get the ‘weighted
degree’ of a node [23]). The degree distribution of a network provides
the probability P(k) of a node to have a degree k. For many biological
and technological networks it has been observed that the logarithm of
P(k) is approximately inversely proportional to the logarithm of k.
Such networks were dubbed scale-free and contain many nodes with
very few connections and a small number of hubs with many
connections [37,70]. The distance between a pair of nodes refers to
the minimum number of edges that needs to be crossed to go from
one node to the other. The clustering coefficient of node i is defined
as the ratio of the number of edges between nodes connected to i
over the total possible number of edges between them. It quantifies
how close the neighborhood of node i is to a clique (a subnetwork in
which each node is connected to all others). Many biological and
technological networks have high average clustering but small
average distances between nodes. Such networks are called small-
world networks [71].
The abstract representation of biomolecular regulatory systems as
networks is fruitful because it provides the ability to study the
systems as a whole while ignoring many irrelevant details [37,72]. All
chemistry and physics is removed in order to concentrate on the
system of interactions. As for all abstractions of natural systems, we
are doomed to lose some information when we
Figure I. An example of a network with nodes (black circles) and edges. Edges
can be directed (black arrows), indicating an effect running from the source node
to the target node, or undirected (red edges), indicating symmetrical
relationships. A network can have only undirected edges (undirected
networks), only directed edges (directed networks), or both (mixed networks).
Edges could be weighted to reflect the strength of the relationship (weighted
networks).represent biomolecular regulatory systems as networks [72–74].
Review Trends in Genetics Vol.26 No.7
in their context, methods have been developed to test forsimultaneous mean expression changes in a priori definedgene sets or pathways [20–22] and gene coexpressionmodules [23,24], as well as to identify differentiallyexpressed subnetworks within protein interaction net-works [2]. Genes or pathways whose mean expressionlevels either rise or fall are generally believed to be associ-ated with the disease phenotype. After performing differ-ential expression tests for each gene or pathway, athreshold level must be established, based on the teststatistic (or corresponding P value), to determine whichgenes and pathways are differentially expressed. Selectinga significance level can be difficult because hundreds orthousands of hypotheses are tested simultaneously. Arejection of the null-hypothesis (i.e. accepting that a geneor pathway is significantly differentially expressed) atP<0.05 can result in many false-positive discoveries. Sev-eral solutions to this multiple hypothesis testing problemhave been proposed, but the false discovery rate [25,26]control is the most widely used method in gene expressiondata analysis.
Although differential expression approaches have beenvery successful, much of the information contained in geneexpression datasets is ignored. Known disease genes areoften not differentially expressed in diseases becausemutations in the coding region can affect the function ofthe gene without affecting its expression level. Further-more, a variety of post-translational modifications (e.g.reversible phosphorylation or acylation) can affect regu-latory activities of a gene product independently of itsexpression level. These facts have hampered the identifi-cation of disease-related genes from gene expression stu-dies.
Elucidating gene networks
On the other side of the spectrum of gene expression dataanalysis are the approaches for network inference. Manyapproaches for inferring gene networks from gene expres-sion data (Box 2) have been proposed and applied togenome-wide expression datasets [27–31] (Box 3). Typi-cally, these methods require much more data than thedifferential expression tests mentioned above and needto be produced under a controlled experimental setup(e.g. a large number of targeted quantitative genetic per-turbations have to be created and the genome-wide geneexpression responses measured). The need for such largedatasets was recently emphasized by the DREAM (Dialo-gue for Reverse-Engineering Assessment and Methods)initiative, in a community effort to infer regulatory net-works [31–35]. Elucidation of reliable genome-scale genenetworks seems outside the scope of current experimentalabilities, but perhaps this should not be the goal in the firstplace. Indeed, it could be argued that obtaining genome-scale networks does not provide much insight into thefunctioning of specific systems underlying diseases.Instead, a targeted approach to identify only subnetworksthat differ between a selected set of phenotypes is a morerelevant goal. As a first step towards that goal it is import-ant to identify how relationships among gene activitieschange between healthy and disease expression samples.
From differential expression to differential coexpressionRecent investigations have gone beyond testing for differ-ential expression and aim to elucidate dysfunctional regu-latory networks in disease. Instead of focusing ondifferences in mean gene expression levels, the goal is toidentify differences in their coexpression patterns (com-
327
Box 3. Inferring gene networks
Gene network inference is the task of identifying the network(s) of
causal regulatory influences between genes that optimally describes
observed gene expression patterns. Inferring causation requires
targeted perturbations and response measurements. This requires
either experimental perturbations (e.g. single gene knockout,
knockdown or over-expression), or natural genetic perturbations
(i.e. simultaneous genotyping and gene expression measurements).
In the latter case, the naturally occurring DNA polymorphisms could
conceptually be seen as systematic perturbations to the regulatory
networks [51]. The basic logic behind gene network inference is
quite simple: when the expression level of gene A is perturbed
(experimentally or by a naturally occurring polymorphism) and
subsequently gene B’s expression level is observed to change then
gene B is causally downstream of gene A. Then, it has to be decided
if the causal effect is direct (not mediated by any set of the other
observed gene activities) or indirect (mediated by some set of the
other observed gene activities). Although simple in concept, the
technical aspects can be quite complicated. There is a high demand
on data because many perturbations are needed to elucidate the
wiring of genome scale gene networks [76]. Such measurements are
typically not available; instead most disease gene expression
studies concern observational data, in other words data collected
over a population of similar individuals without any specified
perturbations (no experimental interventions or genotype data
collected). These data do not generally allow for causal inference
and it is only possible to identify correlations between gene
expression levels.
In correlation networks (also called gene coexpression networks
in this context [75]) pairs of genes are connected by an undirected
edge if their activities (expression levels) behave similarly over a
series of gene expression measurements, usually quantified by
pairwise correlation [27,75]. Gene activities can be correlated due
to different causal relationships including: (i) direct effects, (ii)
indirect effects (transitivity), and (iii) confounding effects (common
regulator). Several algorithms have been proposed to eliminate
edges corresponding to the situations (ii) and (iii) (if the confound-
ing variables are measured) [39–41], resulting in a network with
edges corresponding to direct effects or confounding due to
unobserved variables. Under some assumptions for the network
structure it is theoretically possible to decide the orientation of the
edges [77,78], but unfortunately these assumptions (such as
absence of directed cycles in the network and absence of
confounding factors) are very unlikely to be met in the present
context.
Box 2. What are gene networks?
Gene networks (also called gene regulatory networks [75]) are
abstract models with nodes representing gene activities (gene
expression levels, mRNA concentrations) and edges representing
direct causal influences and correlations between the gene activ-
ities. The direct causal influence A ! B means that the activity of
gene B changes as a consequence of a change in gene activity A and
no other gene activity or set of gene activities mediates the influence
(e.g. in the cascade A! B! C, there is a causal effect of A on C, but
because this is mediated by B there will be no edge drawn from A to
C). A direct causal influence could be due to gene A’s protein
product activating the transcription of gene B upon binding to its
promoter sequence (as in a transcription factor–target relationship,
such as the gene 1 ! gene 2 relation; Figure I), but also to more
complicated processes, such as gene A encoding a metabolic
enzyme producing a metabolite that in turn regulates the transcrip-
tion of gene B. These detailed biochemical events are hidden from
the observed set of variables (gene expression levels) and their
effects will merely result in an observable direct causal effect (such
as the gene 2 ! gene 4 relation; Figure I).
Gene networks are context specific: the regulatory structure
among genes depends on the developmental stage, cell type,
environment, genotype and disease state. For a comprehensive
discussion on the nature of gene networks please refer to Ref. [75].
Figure I. A gene network as the abstract representation of the cellular
biochemistry network. Nodes are metabolites, proteins and gene activities.
Solid arrows depict biochemical processes such as transcriptional regulation,
metabolic conversion and protein association. All detailed biochemical
processes are projected onto causal effects and associations in the gene
activity space, giving rise to the gene network concept. The resulting dashed
arrows represent direct causal regulatory influences between gene activities.
Figure reprinted from Ref. [28]; Trends in Biotechnology, vol. 20, Brazhik, P.
et al., Gene networks, how to put the function in genomics, pp. 467–472,
copyright 2002, with permission from Elsevier.
Review Trends in Genetics Vol.26 No.7
monly quantified by pairwise correlations) in healthy anddisease-affected samples (Box 4). Pairwise relationshipsbetween gene expression levels result from regulatoryrelationships among the genes, and identifying which ofthese are altered in disease-affected tissue as compared tohealthy tissue is a first step in pinpointing dysfunctionalregulatory systems.
328
The first approaches to test for differential coexpression[3–5] applied to cancer gene expression datasets identifiedseveral transcriptional regulators known to be involved incancer that were highly differentially coexpressed whereastheir mean expression levels had hardly changed. Thisillustrated the relevance of considering coexpressionchanges in addition to differential mean expression whencomparing gene expression datasets. Strong support forsuch need was recently demonstrated [10]. As a proof-of-principle, a differential coexpression approach was used tocompare gene expression data from two varieties of bulls,one with and one without a known mutation in the tran-scriptional regulatormyostatin.Whereas themean expres-sion of the myostatin gene did not significantly differbetween the two varieties, the gene was ranked highestamong 920 transcriptional regulators when considering ameasure based on differential coexpression [10]. Severalother investigations yielded the same conclusion [3,4,6–
9,11,12]. It is therefore important to perform differentialcoexpression tests in addition to the common differentialmean expression testing.
Box 4. Differential coexpression
The number of differential coexpression definitions and proposed
statistical tests is plentiful. The common principle uniting these tests
is the common focus on changes in coexpression patterns between
gene expression levels (Figure I).
The association between two gene expression levels can be quantified
by the Pearson correlation: r i j ¼ covðxi ; x j Þ=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffivarðxi Þvarðx jÞ
p.
Here, cov() is the covariance and var() the variance of the gene
expression levels. The correlation between a pair of gene expression
levels is then calculated over the healthy sample, rHi j , and over the
disease sample, rDi j . At this stage one could test two hypotheses
H01 : rHi j ¼ 0 and H02 : rD
i j ¼ 0. If neither is rejected, then we have the
uninteresting scenario where the genes are not correlated in either
sample. If both are rejected but the correlations have the same sign, then
we have the uninteresting scenario that the genes are similarly
correlated in both samples, and could thus not have any significant
involvement in the disease. If only one of the hypotheses is rejected, or
when both are rejected but correlations have changed signs, then the
pair of genes is accepted to be differentially coexpressed. The
approaches based on coexpression networks discussed in the main
text essentially take this approach to differential coexpression. Testing
for non-zero correlations usually is done by the t-test: t i j ¼ r i j
ffiffiffiffiffiffiffiffiffin�2
1�r2i j
r
where n is the number of observations in the sample. Alternatively, one
could test directly for differential coexpression by testing the null-
hypothesis H0 : rHi j ¼ rD
i j . Several approaches following this line of
thought are discussed in the main text. Testing could involve the
Z-test Zi j ¼ jzi j1� zi j2
j=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1n1�3þ 1
n2�3
qwhere zij are the Fisher-transforms
of the correlations zi j ¼ 12ln
1þri j
1�ri j
��� ���, and n1 and n2 are sample sizes of the
healthy and disease sample, respectively. Note that the data require-
ments for differential coexpression analysis are different from those for
differential mean expression analysis. Whereas for the latter often as
few as three observations per group are taken, reliably calculating
correlations will require at least tens of replicates per group, making the
differential coexpression approach only feasible for larger disease
studies. Also, care should be taken when selecting techniques to
normalize raw expression data, as some methods will introduce bias in
the correlation structure of the data [79,80].
Figure I. Example of changed gene expression correlations with unchanged mean expression levels. Each dot corresponds to a healthy (left) or sick (right) subject.
Whereas the correlation in the group of healthy patients is high – genes A and B are tightly coordinated in their expression – this is not the case in the sick group of
subjects.
Review Trends in Genetics Vol.26 No.7
Comparing coexpression networks
Studies of differential coexpression typically haveinvolved, either explicitly or implicitly, the constructionof coexpression networks for healthy and disease samples.Comparing the structure of the two coexpression networksprovides insight into disease-specific alterations in theregulatory systems underlying the correlation patterns.The simplest way to perform such a comparison is to lookat the degree (or connectivity) of each gene in the twonetworks. Genes that have a strongly altered connectivityare thought to play an important role in the diseasephenotype [7,36]. The main difficulty of this approach isestablishing a threshold for each edge to be included in thenetwork. Ideally, one sets the threshold level such that theresulting networks include as many biologically relevantedges while keeping the spurious edges low. The multiplehypothesis testing problem is even more severe than fordifferential expression testing because for n genes we nowhave [n � (n � 1)]/2 edges to test! Several authors haveresorted to arbitrary, stringent thresholds [4,7], or selecteda fixed number of strongest edges [6]. Selecting a very high
correlation threshold indeed guarantees the exclusion ofmany spurious edges, but obviously will also exclude manyrelevant ones.
A potentially effective way to select the threshold is touse the global network topology of the inferred coexpres-sion networks to guide the choice [8,23]. Several globalnetwork topological characteristics have been observed inprotein interaction networks andmetabolic networks, suchas the power-law degree distribution and high clusteringcoefficient [37]. It would be plausible to assume that genenetworks have similar properties and as such are reflectedin coexpression networks as well. However, given thenature of coexpression networks, care should be takenwhen making such conclusions. Indeed, studies show[4,6,7] that coexpression networks possess power-lawdegree distributions. Based on these observations, Zhangand Horvath formally proposed a scale-free topologycriterion for network construction in which a thresholdvalue is chosen such that the resulting networks areapproximately scale free [23]. A recent study showedhow one can use the clustering coefficient to guide the
329
Review Trends in Genetics Vol.26 No.7
selection of this threshold [8]. The motivation behind thisapproach was that clustering coefficients in the coexpres-sion network should be higher than expected by chance.Starting with a fully connected correlation network, theweakest edges are dropped until a maximum differencebetween clustering in the network and randomized net-works is found. Using simulation studies, thresholdsobtained through this approach were shown to consistentlyoutperform statistically motivated thresholds. This is apromising result showing that tools from complex networkanalysis could be a powerful alternative to statisticalapproaches to deal with the difficulties associated withmultiple hypothesis testing.
Comparing weighted coexpression networks
Arguably a better approach to compare coexpression net-works would be to drop the idea of the threshold altogetherand consider weighted coexpression networks in which allgenes are connected to all other genes, and each edge isweighted to reflect the strength (or the confidence in theexistence) of the relationship [23]. Essentially, the analysisof weighted networks is equivalent to the analysis of theweight matrices, such as the correlation matrix. The CoX-press method proposed by Watson [38] is aimed at identi-fying differentially coexpressed gene groups. Theprocedure first performs a hierarchical clustering usingthe correlation matrix obtained from the healthy data (ordisease data), then tests if the average correlation amonggenes in a cluster is higher than expected by chance. Eachcluster with a significant average correlation in the healthydata but not in the disease data, or vice versa, is consideredto be differentially coexpressed. This approach has beenextended recently to allow for using a priori defined genesets and was applied to a study of mammary gland tumorsin mice [12]. Instead of using classical views of pathways(as appearing in textbooks and pathway databases), theauthors provided a top-down definition of pathways basedon a modularization of the mouse protein interaction net-work. These network-based pathways were subjected toboth differential expression (using Gene Set EnrichmentAnalysis: GSEA [20]) and differential coexpression over aprogression of mouse mammary gland tumors coveringthree stages: wild type (healthy), hyperplastic (early dis-ease) and tumor (advanced disease). This study high-lighted the dynamic interplay between the differentialexpression and differential coexpression of the pathways.Some pathways were turned off by downregulation (lowermean expression) and decreased coexpression. Otherswere induced via upregulation and increased coexpression.Furthermore, some pathways showed an increase in meanexpression, but a decrease in coexpression, or vice versa.This counter-intuitive result led to an important insight:although commonly interpreted as an indication that apathway is involved in the disease examined, an increasein the mean intensity of gene expression levels in a path-way accompanied by a decrease in correlations mightmerely indicate a change in functional assignment of con-stituent genes because the genes are potentially part ofmany different pathways. Conversely, a downregulatedpathway with increased correlation might indicate thatthe mean intensity and inter-modular activity is replaced
330
by a higher dedication of the genes to the pathway. Onlylooking at mean expression changes could lead to incorrectconclusions about the involvement of a pathway in a dis-ease condition [12].
The determination of weighted pairwise relationshipscan also be done using soft thresholding [23]. Instead ofusing the raw correlations to obtain the weights, thecorrelation coefficients are raised to a certain power whosevalue is selected in order to obtain a network with scale-free weighted degree distributions [23]. For powers higherthan 1 this results in downsizing the weaker correlationsmore drastically than the higher correlations. Thisweighted coexpression network approach was applied toa study of obesity inmice [9]. Using data from two F2mouseintercrosses, extreme phenotypes (30 leanest versus 30heaviest mice) were contrasted to identify differentialcoexpression involved in the obesity phenotype. For eachphenotype a weighted coexpression network was createdand genes were compared on the basis of their weighteddegree. A set of genes was identified that were increasedboth in mean expression levels and (weighted) connectivityin the obese mice compared to the lean mice. This set wasenriched in EGF and EGF-like factors that have beenreported to play a role in the induction of obesity [9].
Healthy and disease networks could be refined by con-sidering higher order (partial) correlations [39–41] or,equivalently, local dependency networks. This approachwas recently pursued to identify differential dependencynetworks in subclasses of breast cancer [42].
Direct differential coexpression measures
The drawback of compiling two separate coexpression net-works is that it requires separate decisions and thresholdsfor the healthy and disease networks. For example, in theCoXpress method, a cluster that has significant averagecorrelation in the healthy data (decision 1) but not in thedisease data (decision 2) is considered to be differentiallycoexpressed. Identifying differential coexpression could bemade simpler: instead of establishing that the coexpres-sion is significant in one condition and not in the other, onecould test directly if the change in coexpression is signifi-cant. The early test for differential coexpression by Laiet al. [5] belongs to this category of methods, as does themeasure of Hudson et al. [10]. In another early study,Kostka and Spang formulated an approach for selectingsets of genes based on a differential coexpression measure[3]. Recently, Gene Set Coexpression Analysis [43] (GSCA,in analogy to the widely used GSEA for testing differentialmean expression of pathways) was proposed to test forpathway differential coexpression by using a measuresummarizing the change in coexpression over all pairs ofgenes inside a given pathway. The benefit of GSCA com-pared to CoXpress is that the pathways do not have to beenriched in correlations either in the healthy or the diseasedata for the overall change in correlation to be significant[43]. This enables cases to be captured in which somecorrelations in a given pathway go up while others godown, reflecting a specific rewiring of the regulatory sys-tem. Others have looked at changing correlations betweendifferent pathways [44]. Instead of focusing on the coordi-nated expression of genes within a given pathway, the
Review Trends in Genetics Vol.26 No.7
focus was put on the coordination between genes fromdistinct pathways.
In addition to investigating changes in mean geneexpression levels, and correlations between gene expres-sion levels, one could examine the variance of gene expres-sion distributions in healthy versus disease samples[45,46]. Low gene expression variability might indicatestrong homeostatic control. If variances in the diseasesample are drastically higher than in the healthy sample,such control might have been lost.
Themethods described above allow investigators to gaina deeper understanding of disease-associated changes ingene expression patterns beyond differential expression.By identifying differential coexpression, insights weregained which simply were missed when performing com-parisons on mean gene expression levels alone. As Ment-zen et al. pointed out, only looking at mean expressionchanges might even lead to incorrect conclusions about theinvolvement of particular pathways in disease conditions[12]. The shift from differential expression to differentialcoexpression has already delivered its first promises andwill continue to be beneficial for disease studies in thefuture.
From differential coexpression to differentialnetworkingIdentifying differential coexpression is the first steptowards identifying differential gene networks. As BillShipley insightfully states in his book Cause and Corre-lation in Biology [47]: ‘As with shadows, these correlationalpatterns are incomplete – and potentially ambiguous –
projections of the original causal processes. As with sha-dows, we can infer much about the underlying causalprocess if we can learn to study their details, to sharpentheir contours, and especially if we can study them incontext.’
When sets of changed correlations have been identified,the next step is to establish the causal influences in theregulatory systems (i.e. to put directions on the edges in theundirected coexpression network) and, more importantly,to identify which causal influences have disappeared in thedisease network with respect to the healthy network. Suchdisappeared regulatory mechanisms resulted in theobserved changes in correlations and potentially couldunderlie the associated disease phenotype. Although it isnot trivial to identify the causal system from the corre-lation patterns, the changes in correlation hint at theinteresting regions of the network involved in diseasewhich could form the basis for further detailed analysis.Systematic perturbations (e.g. experimental gene knock-outs) are needed to establish the edges’ direction. Particu-lar promise comes from so-called systems geneticsexperiments inwhich genotyping and gene expression data(and possibly metabolomics and proteomics data [48,49])are simultaneously collected from a population understudy. It has been demonstrated that causal links in genenetworks can be elucidated based on these data [50–60](reviewed in Refs [61,62]). Systems genetics datasets havebeen obtained from a wide variety of organisms and manydatasets for human disease studies will be produced in thenear future. Genotyping data are collected at a tremendous
rate and it is becoming clearer that profound insights intohuman disease cannot be obtained by looking at genotypesalone. The complex interplay between thousands of mol-ecular species involved in disease phenotypes must beelucidated to obtain a deeper insight into disease physi-ology [1]. Systems genetic data could be used to perform adifferential coexpression analysis by contrasting twoextremes of the disease phenotype (e.g. the healthiestindividuals versus the most affected individuals), anapproach similar to one previously used in mice [9], fol-lowed by an analysis of the whole dataset to establish thecausal structures underlying the correlational differences.In addition, dysfunctional transcription factors [63–65]and microRNAs involved in disease phenotypes [12] couldbe identified by looking at the changing coexpressionstructure among their experimentally established, andcomputationally predicted, targets. Finally, increased con-fidence in changing coexpression patterns could be gainedby identifying pattern changes that are shared betweendifferent human diseases [6,66] and between humans andanimal disease models. Instead of looking at conservedcoexpression patterns across species [67–69], the focuswould then shift to conserved changes in coexpressionpatterns in similar diseases across species. The differentialnetworking methodology discussed here will certainly playa strong role in future analysis of the massive amounts ofdisease genotyping and gene expression data that will soonbe generated, and is likely to bring profound insights intothe dysfunctional regulatory systems underlying complexhuman diseases.
AcknowledgementsI kindly thank Paolo Uva, Diogo Camacho, three anonymous reviewersand the editor for critical reading of the manuscript and their insightfulsuggestions. This work was supported in part by the Regional Authoritiesof Sardinia (see: http://www.sardegnaricerche.it/).
References1 Schadt, E.E. (2009) Molecular networks as sensors and drivers of
common human diseases. Nature 461, 218–2232 Ideker, T. and Sharan, R. (2008) Protein networks in disease. Genome
Res. 18, 644–6523 Kostka, D. and Spang, R. (2004) Finding disease specific alterations in
the co-expression of genes. Bioinformatics 20 (Suppl. 1), i194–1994 Carter, S.L. et al. (2004) Gene co-expression network topology provides
a framework for molecular characterization of cellular state.Bioinformatics 20, 2242–2250
5 Lai, Y. et al. (2004) A statistical method for identifying differentialgene–gene co-expression patterns. Bioinformatics 20, 3146–3155
6 Choi, J.K. et al. (2005) Differential coexpression analysis usingmicroarray data and its application to human cancer. Bioinformatics21, 4348–4355
7 Reverter, A. et al. (2006) Simultaneous identification of differentialgene expression and connectivity in inflammation, adipogenesis andcancer. Bioinformatics 22, 2396–2404
8 Elo, L.L. et al. (2007) Systematic construction of gene coexpressionnetworks with applications to human T helper cell differentiationprocess. Bioinformatics 23, 2096–2103
9 Fuller, T.F. et al. (2007) Weighted gene coexpression networkanalysis strategies applied to mouse weight. Mamm. Genome 18,463–472
10 Hudson, N.J. et al. (2009) A differential wiring analysis of expressiondata correctly identifies the gene containing the causal mutation. PLoSComput. Biol. 5, e1000382
11 Hu, R. et al. (2009) Detecting intergene correlation changes inmicroarray analysis: a new approach to gene selection. BMCBioinformatics 10, 20
331
Review Trends in Genetics Vol.26 No.7
12 Mentzen, W.I. et al. (2009) Dissecting the dynamics of dysregulation ofcellular processes inmousemammary gland tumor.BMCGenomics 10,601
13 Steuer, R. et al. (2003) Observing and interpreting correlations inmetabolomic networks. Bioinformatics 19, 1019–1026
14 Martins, A.M. et al. (2004) A systems biology study of two distinctgrowth phases of Saccharomyces cerevisiae.Curr. Genomics 5, 649–663
15 Weckwerth, W. et al. (2004) Differential metabolic networks unravelthe effects of silent plant phenotypes.Proc. Natl. Acad. Sci. U. S. A. 101,7809–7814
16 Camacho, D. et al. (2005) The origin of correlations in metabolomicsdata. Metabolomics 1, 53–63
17 Steuer, R. (2006) On the analysis and interpretation of correlations inmetabolomic data Brief. Bioinformatics 7, 151–158
18 Gillis, J. and Pavlidis, P. (2009) A methodology for the analysis ofdifferential coexpression across the human lifespan. BMCBioinformatics 10, 306
19 Cui, X. and Churchill, G.A. (2003) Statistical tests for differentialexpression in cDNA microarray experiments. Genome Biol. 4, 210
20 Subramanian, A. et al. (2005) Gene set enrichment analysis: aknowledge-based approach for interpreting genome-wide expressionprofiles. Proc. Natl. Acad. Sci. U. S. A. 102, 15545–15550
21 Dinu, I. et al. (2007) Improving gene set analysis of microarray data bySAM-GS. BMC Bioinformatics 8, 242
22 Ackermann, M. and Strimmer, K. (2009) A general modular frameworkfor gene set enrichment analysis. BMC Bioinformatics 10, 47
23 Zhang, B. and Horvath, S. (2005) A general framework for weightedgene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4,Article17
24 Ghazalpour, A. et al. (2006) Integrating genetic and network analysisto characterize genes related to mouse weight. PLoS Genet. 2, e130
25 Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discoveryrate – a practical and powerful approach to multiple testing. J. Roy.Stat. Soc. B 57, 289–300
26 Storey, J.D. and Tibshirani, R. (2003) Statistical significance forgenomewide studies. Proc. Natl. Acad. Sci. U. S. A. 100, 9440–9445
27 D’Haeseleer, P. et al. (2000) Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics 16,707–726
28 Brazhnik, P. et al. (2002) Gene networks: how to put the function ingenomics. Trends Biotechnol. 20, 467–472
29 Gardner, T.S. and Faith, J. (2005) Reverse-engineering transcriptioncontrol networks. Phys. Life Rev. 2, 65–88
30 Bansal, M. et al. (2007) How to infer gene networks from expressionprofiles. Mol. Syst. Biol. 3, 78
31 Scheinine, A. et al. (2009) Inferring gene networks: dream ornightmare? Ann, N.Y. Acad. Sci. 1158, 287–301
32 Stolovitzky, G. et al. (2007) Dialogue on reverse-engineeringassessment and methods: the DREAM of high-throughput pathwayinference. Ann. N. Y. Acad. Sci. 1115, 1–22
33 Stolovitzky, G. et al. (2009) The challenges of systems biology. Preface.Ann. N. Y. Acad. Sci. 1158, ix–xii
34 Stolovitzky, G. et al. (2009) Lessons from the DREAM2 Challenges.Ann. N. Y. Acad. Sci. 1158, 159–195
35 Baralla, A. et al. (2009) Inferring gene networks: dream or nightmare?Ann. N. Y. Acad. Sci. 1158, 246–256
36 Leonardson, A.S. et al. (2010) The effect of food intake on geneexpression in human peripheral blood. Hum. Mol. Genet. 19, 159–169
37 Barabasi, A.L. and Oltvai, Z.N. (2004) Network biology: understandingthe cell’s functional organization. Nat. Rev. Genet. 5, 101–113
38 Watson, M. (2006) CoXpress: differential co-expression in geneexpression data. BMC Bioinformatics 7, 509
39 de la Fuente, A. et al. (2004) Discovery of meaningful associations ingenomic data using partial correlation coefficients. Bioinformatics 20,3565–3574
40 Rice, J.J. et al. (2005) Reconstructing biological networks usingconditional correlation analysis. Bioinformatics 21, 765–773
41 Schafer, J. and Strimmer, K. (2005) An empirical Bayes approach toinferring large-scale gene association networks. Bioinformatics 21,754–764
42 Zhang, B. et al. (2009) Differential dependency network analysis toidentify condition-specific topological changes in biological networks.Bioinformatics 25, 526–532
332
43 Choi, Y. and Kendziorski, C. (2009) Statistical methods for gene set co-expression analysis. Bioinformatics 25, 2780–2786
44 Cho, S.B. et al. (2009) Identifying set-wise differential co-expression ingene expression microarray data. BMC Bioinformatics 10, 109
45 Prieto, C. et al. (2006) Algorithm to find gene expression profiles ofderegulation and identify families of disease-altered genes.Bioinformatics 22, 1103–1110
46 Ho, J.W. et al. (2008) Differential variability analysis of geneexpression and its application to human diseases. Bioinformatics 24,i390–398
47 Shipley, B. (2002) Cause and Correlation in Biology: A User’s Guide toPath Analysis, Structural Equations and Causal Inference, CambridgeUniversity Press
48 Keurentjes, J.J. et al. (2006) The genetics of plant metabolism. Nat.Genet. 38, 842–849
49 Fu, J. et al. (2009) System-wide molecular evidence for phenotypicbuffering in Arabidopsis. Nat. Genet. 41, 166–167
50 Jansen, R.C. and Nap, J.P. (2001) Genetical genomics: the added valuefrom segregation. Trends Genet. 17, 388–391
51 Jansen, R.C. (2003) Studying complex biological systems usingmultifactorial perturbation. Nat. Rev. Genet. 4, 145–151
52 Zhu, J. et al. (2004) An integrative genomics approach to thereconstruction of gene networks in segregating populations.Cytogenet. Genome Res. 105, 363–374
53 Bing, N. andHoeschele, I. (2005) Genetical genomics analysis of a yeastsegregant population for transcription network inference.Genetics 170,533–542
54 Bystrykh, L. et al. (2005) Uncovering regulatory pathways that affecthematopoietic stem cell function using ‘genetical genomics’.Nat. Genet.37, 225–232
55 Schadt, E.E. et al. (2005) An integrative genomics approach to infercausal associations between gene expression and disease. Nat. Genet.37, 710–717
56 Lum, P.Y. et al. (2006) Elucidating the murine brain transcriptionalnetwork in a segregating mouse population to identify core functionalmodules for obesity and diabetes. J. Neurochem. 97 (Suppl. 1), 50–62
57 Kulp, D. and Jagalur, M. (2006) Causal inference of regulator-targetpairs by genemapping of expression phenotypes.BMCGenomics 7, 125
58 Liu, B. et al. (2008) Gene network inference via structural equationmodeling in genetical genomics experiments. Genetics 178, 1763–1776
59 Aten, J.E. et al. (2008) Using genetic markers to orient the edges inquantitative trait networks: the NEO software. BMC Syst. Biol. 2, 34
60 Chaibub Neto, E. et al. (2008) Inferring causal phenotype networksfrom segregating populations. Genetics 179, 1089–1100
61 Rockman, M.V. (2008) Reverse engineering the genotype–phenotypemap with natural genetic variation. Nature 456, 738–744
62 Liu, B. et al. (2009) Inferring Gene Regulatory Networks fromGenetical Genomics Data. In: Handbook of Research onComputational Methodologies in Gene Regulatory Networks (Das, S.et al., eds), pp. 79-107, IGI Global
63 Segal, E. et al. (2004) A module map showing conditional activity ofexpression modules in cancer. Nat. Genet. 36, 1090–1098
64 Segal, E. et al. (2005) From signatures tomodels: understanding cancerusing microarrays. Nat. Genet. 37 (Suppl.), S38–45
65 Carro, M.S. et al. (2010) The transcriptional network for mesenchymaltransformation of brain tumours. Nature 463, 318–325
66 Xu, M. et al. (2008) An integrative approach to characterize disease-specific pathways and their coordination: a case study in cancer. BMCGenomics 9 (Suppl. 1), S12
67 Stuart, J.M. et al. (2003) A gene-coexpression network for globaldiscovery of conserved genetic modules. Science 302, 249–255
68 McCarroll, S.A. et al. (2004) Comparing genomic expression patternsacross species identifies shared transcriptional profile in aging. Nat.Genet. 36, 197–204
69 Ihmels, J. et al. (2005) Comparative gene expression analysis bydifferential clustering approach: application to the Candida albicanstranscription program. PLoS Genet 1, e39
70 Barabasi, A.L. and Albert, R. (1999) Emergence of scaling in randomnetworks. Science 286, 509–512
71 Watts, D.J. and Strogatz, S.H. (1998) Collective dynamics of ‘small-world’ networks. Nature 393, 440–442
72 Pieroni, E. et al. (2008) Protein networking: insights into globalfunctional organization of proteomes. Proteomics 8, 799–816
Review Trends in Genetics Vol.26 No.7
73 Bhalla, U.S. (2003) Understanding complex signaling networksthrough models and metaphors. Prog. Biophys. Mol. Biol. 81, 45–65
74 Klamt, S. et al. (2009) Hypergraphs and cellular networks. PLoSComput. Biol. 5, e1000385
75 de la Fuente, A., (2009) What are gene regulatory networks? InHandbook of Research on Computational Methodologies in GeneRegulatory Networks (Das, S. et al., eds), pp. 1-27, IGI Global
76 de la Fuente, A. et al. (2002) Linking the genes: inferring quantitativegene networks from microarray data. Trends Genet. 18, 395–398
77 Spirtes, P. et al. (1993) Causation, Prediction, and Search, MIT Press
78 Opgen-Rhein, R. and Strimmer, K. (2007) From correlation tocausation networks: a simple approximate learning algorithm andits application to high-dimensional plant gene expression data. BMCSyst. Biol. 1, 37
79 Ploner, A. et al. (2005) Correlation test to assess low-level processing ofhigh-density oligonucleotide microarray data. BMC Bioinformatics 6,80
80 Lim, W.K. et al. (2007) Comparative analysis of microarraynormalization procedures: effects on reverse engineering genenetworks. Bioinformatics 23, i282–288
333