from ‘differential expression’ to ‘differential networking’ – identification of...

8
From ‘differential expression’ to ‘differential networking’ identification of dysfunctional regulatory networks in diseases Alberto de la Fuente CRS4 Bioinformatica, Polaris Edificio 3, Localita ` Piscina Manna, 09010 Pula (CA), Italy Understanding diseases requires identifying the differ- ences between healthy and affected tissues. Gene expression data have revolutionized the study of dis- eases by making it possible to simultaneously consider thousands of genes. The identification of disease-associ- ated genes requires studying the genes in the context of the regulatory systems they are involved in. A major goal is to identify specific regulatory networks that are dys- functional in a given disease state. Although we still have not reached a stage where the elucidation of differ- ential regulatory networks is commonly feasible, recent advances have described the first steps towards this goal the identification of differential coexpression net- works. This review describes the shift from differential gene expression to differential networking and outlines how this shift will affect the study of the genetic basis of disease. Dysfunctional networks in disease To understand the roles of genes in complex human dis- eases, genes need to be studied in the context of the regulatory systems they are involved in [1]. Regulatory systems inside cells can be effectively abstracted into networks (Box 1). Such regulatory networks hold the potential to provide the cellular context of all genes of interest and give a means to identify specific subnetworks that are dysfunctional in a given disease state. It is there- fore not surprising that several recent publications expli- citly consider gene expression data in the context of biomolecular networks. In particular, a wide variety of ways to identify protein interaction subnetworks contain- ing many differentially expressed genes in diseases have been proposed [2]. Other studies went beyond differential mean expression and focused on differential coexpression patterns in diseases [312]. The idea behind these approaches is that the identification of changes in gene coexpression patterns between disease and healthy samples provides information about disease-affected regulatory networks. Here I review recent literature pioneering the study of dysfunctional regulatory networks with a focus on meth- odologies used to identify differential coexpression pat- terns in disease gene expression studies. It should be mentioned that differential correlations have also been used with metabolomics data to identify condition-specific alterations in metabolic pathways [1317]. Indeed, differ- ential coexpression approaches could be equally applied to disease studies involving metabolomics and proteomics data. Moreover, these approaches are not limited to dis- ease studies and could be used for elucidating cell- and tissue-specific regulatory networks and changes associ- ated with, for example, aging [18], or applied to other casecontrol settings. Without going into technical details, I will outline the main directions in which these efforts have been pursued. The differential networking methodology requires bringing together the forces of two common approaches to the analysis of gene expression data: differ- ential expression studies and network inference. These two approaches to gene expression analysis will be quickly reviewed to provide a necessary background. Recent approaches for differential coexpression analysis will be discussed and additional routes towards identifying differential regulatory networks in disease will be suggested. Differential expression studies Gene expression studies of disease are typically performed by comparing gene expression levels between diseased and healthy tissues. This is usually done by testing the stat- istical significance of the changes in the mean level of expression of each individual gene [19]. To consider genes Review Glossary Differential coexpression: the observation that the correlation (or other measure of association) between the expression levels of two (or more) genes is significantly different (higher or lower) in case (e.g. disease) and control (e.g. healthy) samples. Differential expression: the observation that the mean expression level of a given gene (or set of genes) is significantly different (higher or lower) between case and control samples. False discovery rate (FDR): expected ratio of false positive discoveries over all discoveries. For example, if the FDR is estimated to be 0.1 at a given statistical threshold, then 10% of the discoveries can be expected to be erroneous. Gene coexpression network: a network model in which the nodes are gene activities and the edges represent significant associations between them. Gene network: an abstract model of gene regulation in which the nodes are gene activities and the edges represent causal influences among the genes (directed) and dependencies due to hidden (unobserved) confounding factors (undirected). Corresponding author: de la Fuente, A. ([email protected]). 326 0168-9525/$ see front matter ß 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2010.05.001 Trends in Genetics 26 (2010) 326333

Upload: alberto-de-la-fuente

Post on 13-Sep-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases

From ‘differential expression’ to‘differential networking’ –identification of dysfunctionalregulatory networks in diseasesAlberto de la Fuente

CRS4 Bioinformatica, Polaris Edificio 3, Localita Piscina Manna, 09010 Pula (CA), Italy

Review

Glossary

Differential coexpression: the observation that the correlation (or other

measure of association) between the expression levels of two (or more) genes

is significantly different (higher or lower) in case (e.g. disease) and control (e.g.

healthy) samples.

Differential expression: the observation that the mean expression level of a

given gene (or set of genes) is significantly different (higher or lower) between

case and control samples.

False discovery rate (FDR): expected ratio of false positive discoveries over all

discoveries. For example, if the FDR is estimated to be 0.1 at a given statistical

threshold, then 10% of the discoveries can be expected to be erroneous.

Gene coexpression network: a network model in which the nodes are gene

activities and the edges represent significant associations between them.

Gene network: an abstract model of gene regulation in which the nodes are

gene activities and the edges represent causal influences among the genes

Understanding diseases requires identifying the differ-ences between healthy and affected tissues. Geneexpression data have revolutionized the study of dis-eases by making it possible to simultaneously considerthousands of genes. The identification of disease-associ-ated genes requires studying the genes in the context ofthe regulatory systems they are involved in. A major goalis to identify specific regulatory networks that are dys-functional in a given disease state. Although we stillhave not reached a stage where the elucidation of differ-ential regulatory networks is commonly feasible, recentadvances have described the first steps towards thisgoal – the identification of differential coexpression net-works. This review describes the shift from differentialgene expression to differential networking and outlineshow this shift will affect the study of the genetic basis ofdisease.

Dysfunctional networks in diseaseTo understand the roles of genes in complex human dis-eases, genes need to be studied in the context of theregulatory systems they are involved in [1]. Regulatorysystems inside cells can be effectively abstracted intonetworks (Box 1). Such regulatory networks hold thepotential to provide the cellular context of all genes ofinterest and give a means to identify specific subnetworksthat are dysfunctional in a given disease state. It is there-fore not surprising that several recent publications expli-citly consider gene expression data in the context ofbiomolecular networks. In particular, a wide variety ofways to identify protein interaction subnetworks contain-ing many differentially expressed genes in diseases havebeen proposed [2]. Other studies went beyond differentialmean expression and focused on differential coexpressionpatterns in diseases [3–12]. The idea behind theseapproaches is that the identification of changes in genecoexpression patterns between disease and healthysamples provides information about disease-affectedregulatory networks.

Here I review recent literature pioneering the study ofdysfunctional regulatory networks with a focus on meth-odologies used to identify differential coexpression pat-

Corresponding author: de la Fuente, A. ([email protected]).

326 0168-9525/$ – see front matter � 2010 Elsevier Lt

terns in disease gene expression studies. It should bementioned that differential correlations have also beenused withmetabolomics data to identify condition-specificalterations in metabolic pathways [13–17]. Indeed, differ-ential coexpression approaches could be equally applied todisease studies involving metabolomics and proteomicsdata. Moreover, these approaches are not limited to dis-ease studies and could be used for elucidating cell- andtissue-specific regulatory networks and changes associ-atedwith, for example, aging [18], or applied to other case–

control settings.Without going into technical details, Iwilloutline the main directions in which these efforts havebeen pursued. The differential networking methodologyrequires bringing together the forces of two commonapproaches to the analysis of gene expression data: differ-ential expression studies and network inference. Thesetwo approaches to gene expression analysis will be quicklyreviewed to provide a necessary background. Recentapproaches for differential coexpression analysis will bediscussed and additional routes towards identifyingdifferential regulatory networks in disease will besuggested.

Differential expression studies

Gene expression studies of disease are typically performedby comparing gene expression levels between diseased andhealthy tissues. This is usually done by testing the stat-istical significance of the changes in the mean level ofexpression of each individual gene [19]. To consider genes

(directed) and dependencies due to hidden (unobserved) confounding factors

(undirected).

d. All rights reserved. doi:10.1016/j.tig.2010.05.001 Trends in Genetics 26 (2010) 326–333

Page 2: From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases

Box 1. Networks in a nutshell

Biomolecular regulatory systems consist of thousands of molecular

species of different chemical nature. These systems have been

described as networks, such as metabolic networks, protein-interac-

tion networks and transcriptional regulatory networks [37]. In these

networks the nodes represent biomolecular species (e.g. metabolites,

proteins, RNAs) and the edges represent functional, causal or physical

interactions between the nodes (Figure I).

The degree of connectivity k of a node is simply the number of

edges attached to it (or sum over the weights to get the ‘weighted

degree’ of a node [23]). The degree distribution of a network provides

the probability P(k) of a node to have a degree k. For many biological

and technological networks it has been observed that the logarithm of

P(k) is approximately inversely proportional to the logarithm of k.

Such networks were dubbed scale-free and contain many nodes with

very few connections and a small number of hubs with many

connections [37,70]. The distance between a pair of nodes refers to

the minimum number of edges that needs to be crossed to go from

one node to the other. The clustering coefficient of node i is defined

as the ratio of the number of edges between nodes connected to i

over the total possible number of edges between them. It quantifies

how close the neighborhood of node i is to a clique (a subnetwork in

which each node is connected to all others). Many biological and

technological networks have high average clustering but small

average distances between nodes. Such networks are called small-

world networks [71].

The abstract representation of biomolecular regulatory systems as

networks is fruitful because it provides the ability to study the

systems as a whole while ignoring many irrelevant details [37,72]. All

chemistry and physics is removed in order to concentrate on the

system of interactions. As for all abstractions of natural systems, we

are doomed to lose some information when we

Figure I. An example of a network with nodes (black circles) and edges. Edges

can be directed (black arrows), indicating an effect running from the source node

to the target node, or undirected (red edges), indicating symmetrical

relationships. A network can have only undirected edges (undirected

networks), only directed edges (directed networks), or both (mixed networks).

Edges could be weighted to reflect the strength of the relationship (weighted

networks).represent biomolecular regulatory systems as networks [72–74].

Review Trends in Genetics Vol.26 No.7

in their context, methods have been developed to test forsimultaneous mean expression changes in a priori definedgene sets or pathways [20–22] and gene coexpressionmodules [23,24], as well as to identify differentiallyexpressed subnetworks within protein interaction net-works [2]. Genes or pathways whose mean expressionlevels either rise or fall are generally believed to be associ-ated with the disease phenotype. After performing differ-ential expression tests for each gene or pathway, athreshold level must be established, based on the teststatistic (or corresponding P value), to determine whichgenes and pathways are differentially expressed. Selectinga significance level can be difficult because hundreds orthousands of hypotheses are tested simultaneously. Arejection of the null-hypothesis (i.e. accepting that a geneor pathway is significantly differentially expressed) atP<0.05 can result in many false-positive discoveries. Sev-eral solutions to this multiple hypothesis testing problemhave been proposed, but the false discovery rate [25,26]control is the most widely used method in gene expressiondata analysis.

Although differential expression approaches have beenvery successful, much of the information contained in geneexpression datasets is ignored. Known disease genes areoften not differentially expressed in diseases becausemutations in the coding region can affect the function ofthe gene without affecting its expression level. Further-more, a variety of post-translational modifications (e.g.reversible phosphorylation or acylation) can affect regu-latory activities of a gene product independently of itsexpression level. These facts have hampered the identifi-cation of disease-related genes from gene expression stu-dies.

Elucidating gene networks

On the other side of the spectrum of gene expression dataanalysis are the approaches for network inference. Manyapproaches for inferring gene networks from gene expres-sion data (Box 2) have been proposed and applied togenome-wide expression datasets [27–31] (Box 3). Typi-cally, these methods require much more data than thedifferential expression tests mentioned above and needto be produced under a controlled experimental setup(e.g. a large number of targeted quantitative genetic per-turbations have to be created and the genome-wide geneexpression responses measured). The need for such largedatasets was recently emphasized by the DREAM (Dialo-gue for Reverse-Engineering Assessment and Methods)initiative, in a community effort to infer regulatory net-works [31–35]. Elucidation of reliable genome-scale genenetworks seems outside the scope of current experimentalabilities, but perhaps this should not be the goal in the firstplace. Indeed, it could be argued that obtaining genome-scale networks does not provide much insight into thefunctioning of specific systems underlying diseases.Instead, a targeted approach to identify only subnetworksthat differ between a selected set of phenotypes is a morerelevant goal. As a first step towards that goal it is import-ant to identify how relationships among gene activitieschange between healthy and disease expression samples.

From differential expression to differential coexpressionRecent investigations have gone beyond testing for differ-ential expression and aim to elucidate dysfunctional regu-latory networks in disease. Instead of focusing ondifferences in mean gene expression levels, the goal is toidentify differences in their coexpression patterns (com-

327

Page 3: From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases

Box 3. Inferring gene networks

Gene network inference is the task of identifying the network(s) of

causal regulatory influences between genes that optimally describes

observed gene expression patterns. Inferring causation requires

targeted perturbations and response measurements. This requires

either experimental perturbations (e.g. single gene knockout,

knockdown or over-expression), or natural genetic perturbations

(i.e. simultaneous genotyping and gene expression measurements).

In the latter case, the naturally occurring DNA polymorphisms could

conceptually be seen as systematic perturbations to the regulatory

networks [51]. The basic logic behind gene network inference is

quite simple: when the expression level of gene A is perturbed

(experimentally or by a naturally occurring polymorphism) and

subsequently gene B’s expression level is observed to change then

gene B is causally downstream of gene A. Then, it has to be decided

if the causal effect is direct (not mediated by any set of the other

observed gene activities) or indirect (mediated by some set of the

other observed gene activities). Although simple in concept, the

technical aspects can be quite complicated. There is a high demand

on data because many perturbations are needed to elucidate the

wiring of genome scale gene networks [76]. Such measurements are

typically not available; instead most disease gene expression

studies concern observational data, in other words data collected

over a population of similar individuals without any specified

perturbations (no experimental interventions or genotype data

collected). These data do not generally allow for causal inference

and it is only possible to identify correlations between gene

expression levels.

In correlation networks (also called gene coexpression networks

in this context [75]) pairs of genes are connected by an undirected

edge if their activities (expression levels) behave similarly over a

series of gene expression measurements, usually quantified by

pairwise correlation [27,75]. Gene activities can be correlated due

to different causal relationships including: (i) direct effects, (ii)

indirect effects (transitivity), and (iii) confounding effects (common

regulator). Several algorithms have been proposed to eliminate

edges corresponding to the situations (ii) and (iii) (if the confound-

ing variables are measured) [39–41], resulting in a network with

edges corresponding to direct effects or confounding due to

unobserved variables. Under some assumptions for the network

structure it is theoretically possible to decide the orientation of the

edges [77,78], but unfortunately these assumptions (such as

absence of directed cycles in the network and absence of

confounding factors) are very unlikely to be met in the present

context.

Box 2. What are gene networks?

Gene networks (also called gene regulatory networks [75]) are

abstract models with nodes representing gene activities (gene

expression levels, mRNA concentrations) and edges representing

direct causal influences and correlations between the gene activ-

ities. The direct causal influence A ! B means that the activity of

gene B changes as a consequence of a change in gene activity A and

no other gene activity or set of gene activities mediates the influence

(e.g. in the cascade A! B! C, there is a causal effect of A on C, but

because this is mediated by B there will be no edge drawn from A to

C). A direct causal influence could be due to gene A’s protein

product activating the transcription of gene B upon binding to its

promoter sequence (as in a transcription factor–target relationship,

such as the gene 1 ! gene 2 relation; Figure I), but also to more

complicated processes, such as gene A encoding a metabolic

enzyme producing a metabolite that in turn regulates the transcrip-

tion of gene B. These detailed biochemical events are hidden from

the observed set of variables (gene expression levels) and their

effects will merely result in an observable direct causal effect (such

as the gene 2 ! gene 4 relation; Figure I).

Gene networks are context specific: the regulatory structure

among genes depends on the developmental stage, cell type,

environment, genotype and disease state. For a comprehensive

discussion on the nature of gene networks please refer to Ref. [75].

Figure I. A gene network as the abstract representation of the cellular

biochemistry network. Nodes are metabolites, proteins and gene activities.

Solid arrows depict biochemical processes such as transcriptional regulation,

metabolic conversion and protein association. All detailed biochemical

processes are projected onto causal effects and associations in the gene

activity space, giving rise to the gene network concept. The resulting dashed

arrows represent direct causal regulatory influences between gene activities.

Figure reprinted from Ref. [28]; Trends in Biotechnology, vol. 20, Brazhik, P.

et al., Gene networks, how to put the function in genomics, pp. 467–472,

copyright 2002, with permission from Elsevier.

Review Trends in Genetics Vol.26 No.7

monly quantified by pairwise correlations) in healthy anddisease-affected samples (Box 4). Pairwise relationshipsbetween gene expression levels result from regulatoryrelationships among the genes, and identifying which ofthese are altered in disease-affected tissue as compared tohealthy tissue is a first step in pinpointing dysfunctionalregulatory systems.

328

The first approaches to test for differential coexpression[3–5] applied to cancer gene expression datasets identifiedseveral transcriptional regulators known to be involved incancer that were highly differentially coexpressed whereastheir mean expression levels had hardly changed. Thisillustrated the relevance of considering coexpressionchanges in addition to differential mean expression whencomparing gene expression datasets. Strong support forsuch need was recently demonstrated [10]. As a proof-of-principle, a differential coexpression approach was used tocompare gene expression data from two varieties of bulls,one with and one without a known mutation in the tran-scriptional regulatormyostatin.Whereas themean expres-sion of the myostatin gene did not significantly differbetween the two varieties, the gene was ranked highestamong 920 transcriptional regulators when considering ameasure based on differential coexpression [10]. Severalother investigations yielded the same conclusion [3,4,6–

9,11,12]. It is therefore important to perform differentialcoexpression tests in addition to the common differentialmean expression testing.

Page 4: From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases

Box 4. Differential coexpression

The number of differential coexpression definitions and proposed

statistical tests is plentiful. The common principle uniting these tests

is the common focus on changes in coexpression patterns between

gene expression levels (Figure I).

The association between two gene expression levels can be quantified

by the Pearson correlation: r i j ¼ covðxi ; x j Þ=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffivarðxi Þvarðx jÞ

p.

Here, cov() is the covariance and var() the variance of the gene

expression levels. The correlation between a pair of gene expression

levels is then calculated over the healthy sample, rHi j , and over the

disease sample, rDi j . At this stage one could test two hypotheses

H01 : rHi j ¼ 0 and H02 : rD

i j ¼ 0. If neither is rejected, then we have the

uninteresting scenario where the genes are not correlated in either

sample. If both are rejected but the correlations have the same sign, then

we have the uninteresting scenario that the genes are similarly

correlated in both samples, and could thus not have any significant

involvement in the disease. If only one of the hypotheses is rejected, or

when both are rejected but correlations have changed signs, then the

pair of genes is accepted to be differentially coexpressed. The

approaches based on coexpression networks discussed in the main

text essentially take this approach to differential coexpression. Testing

for non-zero correlations usually is done by the t-test: t i j ¼ r i j

ffiffiffiffiffiffiffiffiffin�2

1�r2i j

r

where n is the number of observations in the sample. Alternatively, one

could test directly for differential coexpression by testing the null-

hypothesis H0 : rHi j ¼ rD

i j . Several approaches following this line of

thought are discussed in the main text. Testing could involve the

Z-test Zi j ¼ jzi j1� zi j2

j=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1n1�3þ 1

n2�3

qwhere zij are the Fisher-transforms

of the correlations zi j ¼ 12ln

1þri j

1�ri j

��� ���, and n1 and n2 are sample sizes of the

healthy and disease sample, respectively. Note that the data require-

ments for differential coexpression analysis are different from those for

differential mean expression analysis. Whereas for the latter often as

few as three observations per group are taken, reliably calculating

correlations will require at least tens of replicates per group, making the

differential coexpression approach only feasible for larger disease

studies. Also, care should be taken when selecting techniques to

normalize raw expression data, as some methods will introduce bias in

the correlation structure of the data [79,80].

Figure I. Example of changed gene expression correlations with unchanged mean expression levels. Each dot corresponds to a healthy (left) or sick (right) subject.

Whereas the correlation in the group of healthy patients is high – genes A and B are tightly coordinated in their expression – this is not the case in the sick group of

subjects.

Review Trends in Genetics Vol.26 No.7

Comparing coexpression networks

Studies of differential coexpression typically haveinvolved, either explicitly or implicitly, the constructionof coexpression networks for healthy and disease samples.Comparing the structure of the two coexpression networksprovides insight into disease-specific alterations in theregulatory systems underlying the correlation patterns.The simplest way to perform such a comparison is to lookat the degree (or connectivity) of each gene in the twonetworks. Genes that have a strongly altered connectivityare thought to play an important role in the diseasephenotype [7,36]. The main difficulty of this approach isestablishing a threshold for each edge to be included in thenetwork. Ideally, one sets the threshold level such that theresulting networks include as many biologically relevantedges while keeping the spurious edges low. The multiplehypothesis testing problem is even more severe than fordifferential expression testing because for n genes we nowhave [n � (n � 1)]/2 edges to test! Several authors haveresorted to arbitrary, stringent thresholds [4,7], or selecteda fixed number of strongest edges [6]. Selecting a very high

correlation threshold indeed guarantees the exclusion ofmany spurious edges, but obviously will also exclude manyrelevant ones.

A potentially effective way to select the threshold is touse the global network topology of the inferred coexpres-sion networks to guide the choice [8,23]. Several globalnetwork topological characteristics have been observed inprotein interaction networks andmetabolic networks, suchas the power-law degree distribution and high clusteringcoefficient [37]. It would be plausible to assume that genenetworks have similar properties and as such are reflectedin coexpression networks as well. However, given thenature of coexpression networks, care should be takenwhen making such conclusions. Indeed, studies show[4,6,7] that coexpression networks possess power-lawdegree distributions. Based on these observations, Zhangand Horvath formally proposed a scale-free topologycriterion for network construction in which a thresholdvalue is chosen such that the resulting networks areapproximately scale free [23]. A recent study showedhow one can use the clustering coefficient to guide the

329

Page 5: From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases

Review Trends in Genetics Vol.26 No.7

selection of this threshold [8]. The motivation behind thisapproach was that clustering coefficients in the coexpres-sion network should be higher than expected by chance.Starting with a fully connected correlation network, theweakest edges are dropped until a maximum differencebetween clustering in the network and randomized net-works is found. Using simulation studies, thresholdsobtained through this approach were shown to consistentlyoutperform statistically motivated thresholds. This is apromising result showing that tools from complex networkanalysis could be a powerful alternative to statisticalapproaches to deal with the difficulties associated withmultiple hypothesis testing.

Comparing weighted coexpression networks

Arguably a better approach to compare coexpression net-works would be to drop the idea of the threshold altogetherand consider weighted coexpression networks in which allgenes are connected to all other genes, and each edge isweighted to reflect the strength (or the confidence in theexistence) of the relationship [23]. Essentially, the analysisof weighted networks is equivalent to the analysis of theweight matrices, such as the correlation matrix. The CoX-press method proposed by Watson [38] is aimed at identi-fying differentially coexpressed gene groups. Theprocedure first performs a hierarchical clustering usingthe correlation matrix obtained from the healthy data (ordisease data), then tests if the average correlation amonggenes in a cluster is higher than expected by chance. Eachcluster with a significant average correlation in the healthydata but not in the disease data, or vice versa, is consideredto be differentially coexpressed. This approach has beenextended recently to allow for using a priori defined genesets and was applied to a study of mammary gland tumorsin mice [12]. Instead of using classical views of pathways(as appearing in textbooks and pathway databases), theauthors provided a top-down definition of pathways basedon a modularization of the mouse protein interaction net-work. These network-based pathways were subjected toboth differential expression (using Gene Set EnrichmentAnalysis: GSEA [20]) and differential coexpression over aprogression of mouse mammary gland tumors coveringthree stages: wild type (healthy), hyperplastic (early dis-ease) and tumor (advanced disease). This study high-lighted the dynamic interplay between the differentialexpression and differential coexpression of the pathways.Some pathways were turned off by downregulation (lowermean expression) and decreased coexpression. Otherswere induced via upregulation and increased coexpression.Furthermore, some pathways showed an increase in meanexpression, but a decrease in coexpression, or vice versa.This counter-intuitive result led to an important insight:although commonly interpreted as an indication that apathway is involved in the disease examined, an increasein the mean intensity of gene expression levels in a path-way accompanied by a decrease in correlations mightmerely indicate a change in functional assignment of con-stituent genes because the genes are potentially part ofmany different pathways. Conversely, a downregulatedpathway with increased correlation might indicate thatthe mean intensity and inter-modular activity is replaced

330

by a higher dedication of the genes to the pathway. Onlylooking at mean expression changes could lead to incorrectconclusions about the involvement of a pathway in a dis-ease condition [12].

The determination of weighted pairwise relationshipscan also be done using soft thresholding [23]. Instead ofusing the raw correlations to obtain the weights, thecorrelation coefficients are raised to a certain power whosevalue is selected in order to obtain a network with scale-free weighted degree distributions [23]. For powers higherthan 1 this results in downsizing the weaker correlationsmore drastically than the higher correlations. Thisweighted coexpression network approach was applied toa study of obesity inmice [9]. Using data from two F2mouseintercrosses, extreme phenotypes (30 leanest versus 30heaviest mice) were contrasted to identify differentialcoexpression involved in the obesity phenotype. For eachphenotype a weighted coexpression network was createdand genes were compared on the basis of their weighteddegree. A set of genes was identified that were increasedboth in mean expression levels and (weighted) connectivityin the obese mice compared to the lean mice. This set wasenriched in EGF and EGF-like factors that have beenreported to play a role in the induction of obesity [9].

Healthy and disease networks could be refined by con-sidering higher order (partial) correlations [39–41] or,equivalently, local dependency networks. This approachwas recently pursued to identify differential dependencynetworks in subclasses of breast cancer [42].

Direct differential coexpression measures

The drawback of compiling two separate coexpression net-works is that it requires separate decisions and thresholdsfor the healthy and disease networks. For example, in theCoXpress method, a cluster that has significant averagecorrelation in the healthy data (decision 1) but not in thedisease data (decision 2) is considered to be differentiallycoexpressed. Identifying differential coexpression could bemade simpler: instead of establishing that the coexpres-sion is significant in one condition and not in the other, onecould test directly if the change in coexpression is signifi-cant. The early test for differential coexpression by Laiet al. [5] belongs to this category of methods, as does themeasure of Hudson et al. [10]. In another early study,Kostka and Spang formulated an approach for selectingsets of genes based on a differential coexpression measure[3]. Recently, Gene Set Coexpression Analysis [43] (GSCA,in analogy to the widely used GSEA for testing differentialmean expression of pathways) was proposed to test forpathway differential coexpression by using a measuresummarizing the change in coexpression over all pairs ofgenes inside a given pathway. The benefit of GSCA com-pared to CoXpress is that the pathways do not have to beenriched in correlations either in the healthy or the diseasedata for the overall change in correlation to be significant[43]. This enables cases to be captured in which somecorrelations in a given pathway go up while others godown, reflecting a specific rewiring of the regulatory sys-tem. Others have looked at changing correlations betweendifferent pathways [44]. Instead of focusing on the coordi-nated expression of genes within a given pathway, the

Page 6: From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases

Review Trends in Genetics Vol.26 No.7

focus was put on the coordination between genes fromdistinct pathways.

In addition to investigating changes in mean geneexpression levels, and correlations between gene expres-sion levels, one could examine the variance of gene expres-sion distributions in healthy versus disease samples[45,46]. Low gene expression variability might indicatestrong homeostatic control. If variances in the diseasesample are drastically higher than in the healthy sample,such control might have been lost.

Themethods described above allow investigators to gaina deeper understanding of disease-associated changes ingene expression patterns beyond differential expression.By identifying differential coexpression, insights weregained which simply were missed when performing com-parisons on mean gene expression levels alone. As Ment-zen et al. pointed out, only looking at mean expressionchanges might even lead to incorrect conclusions about theinvolvement of particular pathways in disease conditions[12]. The shift from differential expression to differentialcoexpression has already delivered its first promises andwill continue to be beneficial for disease studies in thefuture.

From differential coexpression to differentialnetworkingIdentifying differential coexpression is the first steptowards identifying differential gene networks. As BillShipley insightfully states in his book Cause and Corre-lation in Biology [47]: ‘As with shadows, these correlationalpatterns are incomplete – and potentially ambiguous –

projections of the original causal processes. As with sha-dows, we can infer much about the underlying causalprocess if we can learn to study their details, to sharpentheir contours, and especially if we can study them incontext.’

When sets of changed correlations have been identified,the next step is to establish the causal influences in theregulatory systems (i.e. to put directions on the edges in theundirected coexpression network) and, more importantly,to identify which causal influences have disappeared in thedisease network with respect to the healthy network. Suchdisappeared regulatory mechanisms resulted in theobserved changes in correlations and potentially couldunderlie the associated disease phenotype. Although it isnot trivial to identify the causal system from the corre-lation patterns, the changes in correlation hint at theinteresting regions of the network involved in diseasewhich could form the basis for further detailed analysis.Systematic perturbations (e.g. experimental gene knock-outs) are needed to establish the edges’ direction. Particu-lar promise comes from so-called systems geneticsexperiments inwhich genotyping and gene expression data(and possibly metabolomics and proteomics data [48,49])are simultaneously collected from a population understudy. It has been demonstrated that causal links in genenetworks can be elucidated based on these data [50–60](reviewed in Refs [61,62]). Systems genetics datasets havebeen obtained from a wide variety of organisms and manydatasets for human disease studies will be produced in thenear future. Genotyping data are collected at a tremendous

rate and it is becoming clearer that profound insights intohuman disease cannot be obtained by looking at genotypesalone. The complex interplay between thousands of mol-ecular species involved in disease phenotypes must beelucidated to obtain a deeper insight into disease physi-ology [1]. Systems genetic data could be used to perform adifferential coexpression analysis by contrasting twoextremes of the disease phenotype (e.g. the healthiestindividuals versus the most affected individuals), anapproach similar to one previously used in mice [9], fol-lowed by an analysis of the whole dataset to establish thecausal structures underlying the correlational differences.In addition, dysfunctional transcription factors [63–65]and microRNAs involved in disease phenotypes [12] couldbe identified by looking at the changing coexpressionstructure among their experimentally established, andcomputationally predicted, targets. Finally, increased con-fidence in changing coexpression patterns could be gainedby identifying pattern changes that are shared betweendifferent human diseases [6,66] and between humans andanimal disease models. Instead of looking at conservedcoexpression patterns across species [67–69], the focuswould then shift to conserved changes in coexpressionpatterns in similar diseases across species. The differentialnetworking methodology discussed here will certainly playa strong role in future analysis of the massive amounts ofdisease genotyping and gene expression data that will soonbe generated, and is likely to bring profound insights intothe dysfunctional regulatory systems underlying complexhuman diseases.

AcknowledgementsI kindly thank Paolo Uva, Diogo Camacho, three anonymous reviewersand the editor for critical reading of the manuscript and their insightfulsuggestions. This work was supported in part by the Regional Authoritiesof Sardinia (see: http://www.sardegnaricerche.it/).

References1 Schadt, E.E. (2009) Molecular networks as sensors and drivers of

common human diseases. Nature 461, 218–2232 Ideker, T. and Sharan, R. (2008) Protein networks in disease. Genome

Res. 18, 644–6523 Kostka, D. and Spang, R. (2004) Finding disease specific alterations in

the co-expression of genes. Bioinformatics 20 (Suppl. 1), i194–1994 Carter, S.L. et al. (2004) Gene co-expression network topology provides

a framework for molecular characterization of cellular state.Bioinformatics 20, 2242–2250

5 Lai, Y. et al. (2004) A statistical method for identifying differentialgene–gene co-expression patterns. Bioinformatics 20, 3146–3155

6 Choi, J.K. et al. (2005) Differential coexpression analysis usingmicroarray data and its application to human cancer. Bioinformatics21, 4348–4355

7 Reverter, A. et al. (2006) Simultaneous identification of differentialgene expression and connectivity in inflammation, adipogenesis andcancer. Bioinformatics 22, 2396–2404

8 Elo, L.L. et al. (2007) Systematic construction of gene coexpressionnetworks with applications to human T helper cell differentiationprocess. Bioinformatics 23, 2096–2103

9 Fuller, T.F. et al. (2007) Weighted gene coexpression networkanalysis strategies applied to mouse weight. Mamm. Genome 18,463–472

10 Hudson, N.J. et al. (2009) A differential wiring analysis of expressiondata correctly identifies the gene containing the causal mutation. PLoSComput. Biol. 5, e1000382

11 Hu, R. et al. (2009) Detecting intergene correlation changes inmicroarray analysis: a new approach to gene selection. BMCBioinformatics 10, 20

331

Page 7: From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases

Review Trends in Genetics Vol.26 No.7

12 Mentzen, W.I. et al. (2009) Dissecting the dynamics of dysregulation ofcellular processes inmousemammary gland tumor.BMCGenomics 10,601

13 Steuer, R. et al. (2003) Observing and interpreting correlations inmetabolomic networks. Bioinformatics 19, 1019–1026

14 Martins, A.M. et al. (2004) A systems biology study of two distinctgrowth phases of Saccharomyces cerevisiae.Curr. Genomics 5, 649–663

15 Weckwerth, W. et al. (2004) Differential metabolic networks unravelthe effects of silent plant phenotypes.Proc. Natl. Acad. Sci. U. S. A. 101,7809–7814

16 Camacho, D. et al. (2005) The origin of correlations in metabolomicsdata. Metabolomics 1, 53–63

17 Steuer, R. (2006) On the analysis and interpretation of correlations inmetabolomic data Brief. Bioinformatics 7, 151–158

18 Gillis, J. and Pavlidis, P. (2009) A methodology for the analysis ofdifferential coexpression across the human lifespan. BMCBioinformatics 10, 306

19 Cui, X. and Churchill, G.A. (2003) Statistical tests for differentialexpression in cDNA microarray experiments. Genome Biol. 4, 210

20 Subramanian, A. et al. (2005) Gene set enrichment analysis: aknowledge-based approach for interpreting genome-wide expressionprofiles. Proc. Natl. Acad. Sci. U. S. A. 102, 15545–15550

21 Dinu, I. et al. (2007) Improving gene set analysis of microarray data bySAM-GS. BMC Bioinformatics 8, 242

22 Ackermann, M. and Strimmer, K. (2009) A general modular frameworkfor gene set enrichment analysis. BMC Bioinformatics 10, 47

23 Zhang, B. and Horvath, S. (2005) A general framework for weightedgene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4,Article17

24 Ghazalpour, A. et al. (2006) Integrating genetic and network analysisto characterize genes related to mouse weight. PLoS Genet. 2, e130

25 Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discoveryrate – a practical and powerful approach to multiple testing. J. Roy.Stat. Soc. B 57, 289–300

26 Storey, J.D. and Tibshirani, R. (2003) Statistical significance forgenomewide studies. Proc. Natl. Acad. Sci. U. S. A. 100, 9440–9445

27 D’Haeseleer, P. et al. (2000) Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics 16,707–726

28 Brazhnik, P. et al. (2002) Gene networks: how to put the function ingenomics. Trends Biotechnol. 20, 467–472

29 Gardner, T.S. and Faith, J. (2005) Reverse-engineering transcriptioncontrol networks. Phys. Life Rev. 2, 65–88

30 Bansal, M. et al. (2007) How to infer gene networks from expressionprofiles. Mol. Syst. Biol. 3, 78

31 Scheinine, A. et al. (2009) Inferring gene networks: dream ornightmare? Ann, N.Y. Acad. Sci. 1158, 287–301

32 Stolovitzky, G. et al. (2007) Dialogue on reverse-engineeringassessment and methods: the DREAM of high-throughput pathwayinference. Ann. N. Y. Acad. Sci. 1115, 1–22

33 Stolovitzky, G. et al. (2009) The challenges of systems biology. Preface.Ann. N. Y. Acad. Sci. 1158, ix–xii

34 Stolovitzky, G. et al. (2009) Lessons from the DREAM2 Challenges.Ann. N. Y. Acad. Sci. 1158, 159–195

35 Baralla, A. et al. (2009) Inferring gene networks: dream or nightmare?Ann. N. Y. Acad. Sci. 1158, 246–256

36 Leonardson, A.S. et al. (2010) The effect of food intake on geneexpression in human peripheral blood. Hum. Mol. Genet. 19, 159–169

37 Barabasi, A.L. and Oltvai, Z.N. (2004) Network biology: understandingthe cell’s functional organization. Nat. Rev. Genet. 5, 101–113

38 Watson, M. (2006) CoXpress: differential co-expression in geneexpression data. BMC Bioinformatics 7, 509

39 de la Fuente, A. et al. (2004) Discovery of meaningful associations ingenomic data using partial correlation coefficients. Bioinformatics 20,3565–3574

40 Rice, J.J. et al. (2005) Reconstructing biological networks usingconditional correlation analysis. Bioinformatics 21, 765–773

41 Schafer, J. and Strimmer, K. (2005) An empirical Bayes approach toinferring large-scale gene association networks. Bioinformatics 21,754–764

42 Zhang, B. et al. (2009) Differential dependency network analysis toidentify condition-specific topological changes in biological networks.Bioinformatics 25, 526–532

332

43 Choi, Y. and Kendziorski, C. (2009) Statistical methods for gene set co-expression analysis. Bioinformatics 25, 2780–2786

44 Cho, S.B. et al. (2009) Identifying set-wise differential co-expression ingene expression microarray data. BMC Bioinformatics 10, 109

45 Prieto, C. et al. (2006) Algorithm to find gene expression profiles ofderegulation and identify families of disease-altered genes.Bioinformatics 22, 1103–1110

46 Ho, J.W. et al. (2008) Differential variability analysis of geneexpression and its application to human diseases. Bioinformatics 24,i390–398

47 Shipley, B. (2002) Cause and Correlation in Biology: A User’s Guide toPath Analysis, Structural Equations and Causal Inference, CambridgeUniversity Press

48 Keurentjes, J.J. et al. (2006) The genetics of plant metabolism. Nat.Genet. 38, 842–849

49 Fu, J. et al. (2009) System-wide molecular evidence for phenotypicbuffering in Arabidopsis. Nat. Genet. 41, 166–167

50 Jansen, R.C. and Nap, J.P. (2001) Genetical genomics: the added valuefrom segregation. Trends Genet. 17, 388–391

51 Jansen, R.C. (2003) Studying complex biological systems usingmultifactorial perturbation. Nat. Rev. Genet. 4, 145–151

52 Zhu, J. et al. (2004) An integrative genomics approach to thereconstruction of gene networks in segregating populations.Cytogenet. Genome Res. 105, 363–374

53 Bing, N. andHoeschele, I. (2005) Genetical genomics analysis of a yeastsegregant population for transcription network inference.Genetics 170,533–542

54 Bystrykh, L. et al. (2005) Uncovering regulatory pathways that affecthematopoietic stem cell function using ‘genetical genomics’.Nat. Genet.37, 225–232

55 Schadt, E.E. et al. (2005) An integrative genomics approach to infercausal associations between gene expression and disease. Nat. Genet.37, 710–717

56 Lum, P.Y. et al. (2006) Elucidating the murine brain transcriptionalnetwork in a segregating mouse population to identify core functionalmodules for obesity and diabetes. J. Neurochem. 97 (Suppl. 1), 50–62

57 Kulp, D. and Jagalur, M. (2006) Causal inference of regulator-targetpairs by genemapping of expression phenotypes.BMCGenomics 7, 125

58 Liu, B. et al. (2008) Gene network inference via structural equationmodeling in genetical genomics experiments. Genetics 178, 1763–1776

59 Aten, J.E. et al. (2008) Using genetic markers to orient the edges inquantitative trait networks: the NEO software. BMC Syst. Biol. 2, 34

60 Chaibub Neto, E. et al. (2008) Inferring causal phenotype networksfrom segregating populations. Genetics 179, 1089–1100

61 Rockman, M.V. (2008) Reverse engineering the genotype–phenotypemap with natural genetic variation. Nature 456, 738–744

62 Liu, B. et al. (2009) Inferring Gene Regulatory Networks fromGenetical Genomics Data. In: Handbook of Research onComputational Methodologies in Gene Regulatory Networks (Das, S.et al., eds), pp. 79-107, IGI Global

63 Segal, E. et al. (2004) A module map showing conditional activity ofexpression modules in cancer. Nat. Genet. 36, 1090–1098

64 Segal, E. et al. (2005) From signatures tomodels: understanding cancerusing microarrays. Nat. Genet. 37 (Suppl.), S38–45

65 Carro, M.S. et al. (2010) The transcriptional network for mesenchymaltransformation of brain tumours. Nature 463, 318–325

66 Xu, M. et al. (2008) An integrative approach to characterize disease-specific pathways and their coordination: a case study in cancer. BMCGenomics 9 (Suppl. 1), S12

67 Stuart, J.M. et al. (2003) A gene-coexpression network for globaldiscovery of conserved genetic modules. Science 302, 249–255

68 McCarroll, S.A. et al. (2004) Comparing genomic expression patternsacross species identifies shared transcriptional profile in aging. Nat.Genet. 36, 197–204

69 Ihmels, J. et al. (2005) Comparative gene expression analysis bydifferential clustering approach: application to the Candida albicanstranscription program. PLoS Genet 1, e39

70 Barabasi, A.L. and Albert, R. (1999) Emergence of scaling in randomnetworks. Science 286, 509–512

71 Watts, D.J. and Strogatz, S.H. (1998) Collective dynamics of ‘small-world’ networks. Nature 393, 440–442

72 Pieroni, E. et al. (2008) Protein networking: insights into globalfunctional organization of proteomes. Proteomics 8, 799–816

Page 8: From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases

Review Trends in Genetics Vol.26 No.7

73 Bhalla, U.S. (2003) Understanding complex signaling networksthrough models and metaphors. Prog. Biophys. Mol. Biol. 81, 45–65

74 Klamt, S. et al. (2009) Hypergraphs and cellular networks. PLoSComput. Biol. 5, e1000385

75 de la Fuente, A., (2009) What are gene regulatory networks? InHandbook of Research on Computational Methodologies in GeneRegulatory Networks (Das, S. et al., eds), pp. 1-27, IGI Global

76 de la Fuente, A. et al. (2002) Linking the genes: inferring quantitativegene networks from microarray data. Trends Genet. 18, 395–398

77 Spirtes, P. et al. (1993) Causation, Prediction, and Search, MIT Press

78 Opgen-Rhein, R. and Strimmer, K. (2007) From correlation tocausation networks: a simple approximate learning algorithm andits application to high-dimensional plant gene expression data. BMCSyst. Biol. 1, 37

79 Ploner, A. et al. (2005) Correlation test to assess low-level processing ofhigh-density oligonucleotide microarray data. BMC Bioinformatics 6,80

80 Lim, W.K. et al. (2007) Comparative analysis of microarraynormalization procedures: effects on reverse engineering genenetworks. Bioinformatics 23, i282–288

333