introduction to network medicine

28
INTRODUCTION TO NETWORK MEDICINE Marc Santolini Center for Complex Network Research (CCNR)

Upload: marc-santolini

Post on 21-Jan-2018

347 views

Category:

Science


8 download

TRANSCRIPT

Page 1: Introduction to Network Medicine

INTRODUCTION TO NETWORK MEDICINEMarc SantoliniCenter for Complex Network Research (CCNR)

Page 2: Introduction to Network Medicine

Reductionism, which has dominated biological researchfor over a century, has provided a wealth of knowledgeabout individual cellular components and their func-tions. Despite its enormous success, it is increasinglyclear that a discrete biological function can only rarelybe attributed to an individual molecule. Instead, mostbiological characteristics arise from complex interac-tions between the cell’s numerous constituents, such asproteins, DNA, RNA and small molecules1–8. Therefore,a key challenge for biology in the twenty-first century is tounderstand the structure and the dynamics of the com-plex intercellular web of interactions that contribute tothe structure and function of a living cell.

The development of high-throughput data-collectiontechniques, as epitomized by the widespread use ofmicroarrays, allows for the simultaneous interrogation of the status of a cell’s components at any given time.In turn, new technology platforms, such as PROTEIN CHIPS

or semi-automated YEAST TWO-HYBRID SCREENS, help to deter-mine how and when these molecules interact with eachother. Various types of interaction webs, or networks,(including protein–protein interaction, metabolic, sig-nalling and transcription-regulatory networks) emergefrom the sum of these interactions. None of these net-works are independent, instead they form a ‘network ofnetworks’ that is responsible for the behaviour of thecell. A major challenge of contemporary biology is toembark on an integrated theoretical and experimental

programme to map out, understand and model in quan-tifiable terms the topological and dynamic properties of thevarious networks that control the behaviour of the cell.

Help along the way is provided by the rapidly develop-ing theory of complex networks that, in the past fewyears, has made advances towards uncovering the orga-nizing principles that govern the formation and evolutionof various complex technological and social networks9–12.This research is already making an impact on cell biology.It has led to the realization that the architectural featuresof molecular interaction networks within a cell are sharedto a large degree by other complex systems, such as theInternet, computer chips and society. This unexpecteduniversality indicates that similar laws may govern mostcomplex networks in nature, which allows the expertisefrom large and well-mapped non-biological systems to beused to characterize the intricate interwoven relationshipsthat govern cellular functions.

In this review, we show that the quantifiable tools ofnetwork theory offer unforeseen possibilities to under-stand the cell’s internal organization and evolution,fundamentally altering our view of cell biology. Theemerging results are forcing the realization that, not-withstanding the importance of individual molecules,cellular function is a contextual attribute of strict and quantifiable patterns of interactions between themyriad of cellular constituents. Although uncoveringthe generic organizing principles of cellular networks

NETWORK BIOLOGY:UNDERSTANDING THE CELL’SFUNCTIONAL ORGANIZATIONAlbert-László Barabási* & Zoltán N. Oltvai‡

A key aim of postgenomic biomedical research is to systematically catalogue all molecules andtheir interactions within a living cell. There is a clear need to understand how these molecules andthe interactions between them determine the function of this enormously complex machinery, bothin isolation and when surrounded by other cells. Rapid advances in network biology indicate thatcellular networks are governed by universal laws and offer a new conceptual framework that couldpotentially revolutionize our view of biology and disease pathologies in the twenty-first century.

PROTEIN CHIPS

Similar to cDNA microarrays,this evolving technologyinvolves arraying a genomic setof proteins on a solid surfacewithout denaturing them. Theproteins are arrayed at a highenough density for the detection of activity, binding to lipids and so on.

NATURE REVIEWS | GENETICS VOLUME 5 | FEBRUARY 2004 | 101

*Department of Physics,University of Notre Dame,Notre Dame, Indiana 46556,USA.‡Department of Pathology,Northwestern University,Chicago, Illinois 60611,USA.e-mails: [email protected];[email protected]:10.1038/nrg1272

R E V I EW S

Barabasi et al., Nat Rev Genet 2004

NATURE REVIEWS | GENETICS VOLUME 5 | FEBRUARY 2004 | 105

R E V I EW S

Box 2 | Network models

Network models are crucial for shaping our understanding of complex networks and help to explain the origin of observed networkcharacteristics. There are three models that had a direct impact on our understanding of biological networks.

Random networks The Erdös–Rényi (ER) model of a random network14 (see figure, part A) starts with N nodes and connects each pair of nodes with probability p,which creates a graph with approximately pN(N–1)/2 randomly placed links (see figure, part Aa). The node degrees follow a Poisson distribution(see figure, part Ab), which indicates that most nodes have approximately the same number of links (close to the average degree <k>). The tail(high k region) of the degree distribution P(k) decreases exponentially, which indicates that nodes that significantly deviate from the average areextremely rare. The clustering coefficient is independent of a node’s degree, so C(k) appears as a horizontal line if plotted as a function of k (seefigure, part Ac). The mean path length is proportional to the logarithm of the network size, l ~ log N, which indicates that it is characterized by thesmall-world property.

Scale-free networksScale-free networks (see figure, part B) are characterized by a power-law degree distribution; the probability that a node has k links follows P(k) ~ k –γ, where γ is the degree exponent. The probability that a node is highly connected is statistically more significant than in a random graph,the network’s properties often being determined by a relatively small number of highly connected nodes that are known as hubs (see figure, partBa; blue nodes). In the Barabási–Albert model of a scale-free network15, at each time point a node with M links is added to the network, whichconnects to an already existing node I with probability ΠI = kI/ΣJkJ, where kI is the degree of node I (FIG. 3) and J is the index denoting the sum overnetwork nodes. The network that is generated by this growth process has a power-law degree distribution that is characterized by the degreeexponent γ = 3. Such distributions are seen as a straight line on a log–log plot (see figure, part Bb). The network that is created by theBarabási–Albert model does not have an inherent modularity, so C(k) is independent of k (see figure, part Bc). Scale-free networks with degreeexponents 2<γ<3, a range that is observed in most biological and non-biological networks, are ultra-small34,35, with the average path lengthfollowing ! ~ log log N, which is significantly shorter than log N that characterizes random small-world networks.

Hierarchical networksTo account for the coexistence of modularity, local clustering and scale-free topology in many real systems it has to be assumed that clusterscombine in an iterative manner, generating a hierarchical network47,53 (see figure, part C). The starting point of this construction is a small clusterof four densely linked nodes (see the four central nodes in figure, part Ca). Next, three replicas of this module are generated and the three externalnodes of the replicated clustersconnected to the central node ofthe old cluster, which produces alarge 16-node module. Threereplicas of this 16-node moduleare then generated and the 16peripheral nodes connected tothe central node of the oldmodule, which produces a newmodule of 64 nodes. Thehierarchical network modelseamlessly integrates a scale-freetopology with an inherentmodular structure by generatinga network that has a power-lawdegree distribution with degreeexponent γ = 1 + !n4/!n3 = 2.26(see figure, part Cb) and a large,system-size independent averageclustering coefficient <C> ~ 0.6.The most important signature ofhierarchical modularity is thescaling of the clusteringcoefficient, which follows C(k) ~ k –1 a straight line of slope–1 on a log–log plot (see figure,part Cc). A hierarchicalarchitecture implies that sparselyconnected nodes are part ofhighly clustered areas, withcommunication between thedifferent highly clusteredneighbourhoods beingmaintained by a few hubs (see figure, part Ca).

A Random network

Ab

Ac

Aa

Bb

Bc

Ba

Cb

Cc

Ca

B Scale-free network C Hierarchical network

1

0.1

0.01

0.001

0.0001

1 10 100 1,000

P(k

)C

(k)

k k

kk k

P(k

)

P(k

)

100

10

10–1

10–2

10–3

10–4

10–5

10–6

10–7

10–8

100 1,000 10,000

C(k

)

log

C(k

)

log k

SCALE-FREE NETWORKS

Page 3: Introduction to Network Medicine

104 | FEBRUARY 2004 | VOLUME 5 www.nature.com/reviews/genetics

R E V I EW S

mathematical properties of random networks14. Theirmuch-investigated random network model assumes thata fixed number of nodes are connected randomly to eachother (BOX 2). The most remarkable property of the modelis its ‘democratic’or uniform character, characterizing thedegree, or connectivity (k ; BOX 1), of the individual nodes.Because, in the model, the links are placed randomlyamong the nodes, it is expected that some nodes collectonly a few links whereas others collect many more. In arandom network, the nodes degrees follow a Poissondistribution, which indicates that most nodes haveroughly the same number of links, approximately equalto the network’s average degree, <k> (where <> denotesthe average); nodes that have significantly more or lesslinks than <k> are absent or very rare (BOX 2).

Despite its elegance, a series of recent findings indi-cate that the random network model cannot explainthe topological properties of real networks. The deviations from the random model have several keysignatures, the most striking being the finding that, incontrast to the Poisson degree distribution, for manysocial and technological networks the number of nodeswith a given degree follows a power law. That is, theprobability that a chosen node has exactly k links follows P(k) ~ k –γ, where γ is the degree exponent, withits value for most networks being between 2 and 3 (REF. 15). Networks that are characterized by a power-lawdegree distribution are highly non-uniform, most ofthe nodes have only a few links. A few nodes with a verylarge number of links, which are often called hubs, holdthese nodes together. Networks with a power degreedistribution are called scale-free15, a name that is rootedin statistical physics literature. It indicates the absenceof a typical node in the network (one that could beused to characterize the rest of the nodes). This is instrong contrast to random networks, for which thedegree of all nodes is in the vicinity of the averagedegree, which could be considered typical. However,scale-free networks could easily be called scale-rich aswell, as their main feature is the coexistence of nodes ofwidely different degrees (scales), from nodes with oneor two links to major hubs.

Cellular networks are scale-free. An important develop-ment in our understanding of the cellular networkarchitecture was the finding that most networks withinthe cell approximate a scale-free topology. The first evi-dence came from the analysis of metabolism, in whichthe nodes are metabolites and the links representenzyme-catalysed biochemical reactions (FIG. 1).As manyof the reactions are irreversible, metabolic networks aredirected. So, for each metabolite an ‘in’ and an ‘out’degree (BOX 1) can be assigned that denotes the numberof reactions that produce or consume it, respectively.The analysis of the metabolic networks of 43 differentorganisms from all three domains of life (eukaryotes,bacteria, and archaea) indicates that the cellular metabo-lism has a scale-free topology, in which most metabolicsubstrates participate in only one or two reactions, but afew, such as pyruvate or coenzyme A, participate indozens and function as metabolic hubs16,17.

Depending on the nature of the interactions, net-works can be directed or undirected. In directednetworks, the interaction between any two nodes has awell-defined direction, which represents, for example,the direction of material flow from a substrate to aproduct in a metabolic reaction, or the direction ofinformation flow from a transcription factor to the genethat it regulates. In undirected networks, the links donot have an assigned direction. For example, in proteininteraction networks (FIG. 2) a link represents a mutualbinding relationship: if protein A binds to protein B,then protein B also binds to protein A.

Architectural features of cellular networksFrom random to scale-free networks. Probably the mostimportant discovery of network theory was the realiza-tion that despite the remarkable diversity of networksin nature, their architecture is governed by a few simpleprinciples that are common to most networks of majorscientific and technological interest9,10. For decadesgraph theory — the field of mathematics that dealswith the mathematical foundations of networks —modelled complex networks either as regular objects,such as a square or a diamond lattice, or as completelyrandom network13. This approach was rooted in theinfluential work of two mathematicians, Paul Erdös,and Alfréd Rényi, who in 1960 initiated the study of the

Figure 2 | Yeast protein interaction network. A map of protein–protein interactions18 inSaccharomyces cerevisiae, which is based on early yeast two-hybrid measurements23, illustratesthat a few highly connected nodes (which are also known as hubs) hold the network together.The largest cluster, which contains ~78% of all proteins, is shown. The colour of a node indicatesthe phenotypic effect of removing the corresponding protein (red = lethal, green = non-lethal,orange = slow growth, yellow = unknown). Reproduced with permission from REF. 18 ©Macmillan Magazines Ltd.

Jeong et al., “Lethality and centrality in protein networks“ Nature 2001

THE YEAST INTERACTOME

Page 4: Introduction to Network Medicine

FABRICATING HUBS

NATURE REVIEWS | GENETICS VOLUME 5 | FEBRUARY 2004 | 107

R E V I EW S

major engineer of the genomic landscape, it is likely tobe a key mechanism for generating the scale-freetopology.

Two further results offer direct evidence that net-work growth is responsible for the observed topologicalfeatures. The scale-free model (BOX 2) predicts that thenodes that appeared early in the history of the networkare the most connected ones15. Indeed, an inspection ofthe metabolic hubs indicates that the remnants of theRNA world, such as coenzyme A, NAD and GTP, areamong the most connected substrates of the metabolicnetwork, as are elements of some of the most ancientmetabolic pathways, such as glycolysis and the tricar-boxylic acid cycle17. In the context of the protein interac-tion networks, cross-genome comparisons have foundthat, on average, the evolutionarily older proteins havemore links to other proteins than their younger coun-terparts45,46. This offers direct empirical evidence forpreferential attachment.

Motifs, modules and hierarchical networksCellular functions are likely to be carried out in a highlymodular manner1. In general, modularity refers to agroup of physically or functionally linked molecules(nodes) that work together to achieve a (relatively) dis-tinct function1,6,8,47. Modules are seen in many systems,for example, circles of friends in social networks or web-sites that are devoted to similar topics on the WorldWide Web. Similarly, in many complex engineered sys-tems, from a modern aircraft to a computer chip, ahighly modular structure is a fundamental designattribute.

Biology is full of examples of modularity. Relativelyinvariant protein–protein and protein–RNA complexes(physical modules) are at the core of many basic biolog-ical functions, from nucleic-acid synthesis to proteindegradation48. Similarly, temporally coregulated groupsof molecules are known to govern various stages of thecell cycle49–51, or to convey extracellular signals in bacter-ial chemotaxis or the yeast pheromone response path-way. In fact, most molecules in a cell are either part of anintracellular complex with modular activity, such as theribosome, or they participate in an extended (func-tional) module as a temporally regulated element of arelatively distinct process (for example, signal amplifica-tion in a signalling pathway52).

To address the modularity of networks, tools andmeasures need to be developed that will allow us notonly to establish if a network is modular, but also toexplicitly identify the modules and their relationships ina given network.

High clustering in cellular networks. In a network repre-sentation, a module (or cluster) appears as a highlyinterconnected group of nodes. Each module can bereduced to a set of triangles (BOX 1); a high density of tri-angles is reflected by the clustering coefficient, C (REF. 33),the signature of a network’s potential modularity (BOX 1). In the absence of modularity, the clustering coef-ficient of the real and the randomized network are com-parable. The average clustering coefficient, <C>, of

a

b

Proteins

1

2

Proteins

Genes

Genes

Before duplication

After duplication

Figure 3 | The origin of the scale-free topology and hubsin biological networks. The origin of the scale-free topologyin complex networks can be reduced to two basicmechanisms: growth and preferential attachment. Growthmeans that the network emerges through the subsequentaddition of new nodes, such as the new red node that is addedto the network that is shown in part a. Preferential attachmentmeans that new nodes prefer to link to more connected nodes.For example, the probability that the red node will connect tonode 1 is twice as large as connecting to node 2, as thedegree of node 1 (k1=4) is twice the degree of node 2 (k2=2).Growth and preferential attachment generate hubs through a‘rich-gets-richer’ mechanism: the more connected a node is,the more likely it is that new nodes will link to it, which allowsthe highly connected nodes to acquire new links faster thantheir less connected peers. In protein interaction networks,scale-free topology seems to have its origin in geneduplication. Part b shows a small protein interaction network(blue) and the genes that encode the proteins (green). Whencells divide, occasionally one or several genes are copied twiceinto the offspring’s genome (illustrated by the green and redcircles). This induces growth in the protein interaction networkbecause now we have an extra gene that encodes a newprotein (red circle). The new protein has the same structure asthe old one, so they both interact with the same proteins.Ultimately, the proteins that interacted with the originalduplicated protein will each gain a new interaction to the newprotein. Therefore proteins with a large number of interactionstend to gain links more often, as it is more likely that theyinteract with the protein that has been duplicated. This is amechanism that generates preferential attachment in cellularnetworks. Indeed, in the example that is shown in part b it doesnot matter which gene is duplicated, the most connectedcentral protein (hub) gains one interaction. In contrast, thesquare, which has only one link, gains a new link only if the hubis duplicated.

Page 5: Introduction to Network Medicine

NETWORK MOTIFS

What led to the pervasiveness of hybridsbetween Cx. pipiens and Cx. molestus inNorth America, but not in Europe and Af-rica, still remains to be determined. Insouthernmost Europe, we identified twopopulations with a few hybrid individuals,as well as populations with pure Cx. pipienssignatures and populations with a mix ofpure Cx. pipiens and pure Cx. molestussignatures (Fig. 2). Indeed, previousallozyme- based studies indicated the exis-tence of populations in Italy with a mix ofthe two forms (26 ) but a very low rate ofhybridization (1%), probably because oftheir different mating behaviors (26 ). Therarity of southern European hybrids and ourfailure to find hybrids in northern Europemay be due to their low fitness and inabilityto diapause. Importantly, the introduction to theUnited States of separate populations of Cx. pipi-ens and Cx. molestus that later hybridized, or ofhybrids from southern Europe, has led to abun-dant and ubiquitous hybrid forms that survive therigors of northern winters.

It is now clear that models derived fromthe U.S. epidemic of WNV (28) may not beapplicable to Eurasia, and vice versa (29).A major factor in all recent outbreaks (Ro-mania 1996, Russia 1999, and UnitedStates 1999) is the involvement of mosqui-toes in the Cx. pipiens complex as theprimary vectors (8, 30). Unlike EuropeanCx. pipiens, U.S. Cx. pipiens appears to bitereadily both avian hosts and humans (2,31). Here we have shown that, across thenortheastern United States, a large propor-tion of individuals are hybrids of human-biter and bird-biter forms. In combinationwith susceptible migrating birds and highlyconcentrated human populations in U.S.cities and suburbs, the prevalence of suchbridge vectors that readily transmit the vi-rus among and between avian hosts andhumans could have created the current ep-idemic conditions.

The present study suggests that changes invectorial capacity and the creation of newefficient vectors may occur with new intro-ductions. In particular, the arrival of hybridAmerican forms in northern Europe has thepotential to radically change the dynamics ofWNV in Europe.

References and Notes1. E. B. Vinagradova, Culex pipiens pipiens Mosquitoes:Taxonomy, Distribution, Ecology, Physiology, Genet-ics, Applied Importance, and Control (Pensoft, Mos-cow, 2000).

2. A. Spielman, Ann. N.Y. Acad. Sci. 951, 220 (2001).3. M. J. Turell, M. L. O’Guinn, D. J. Dohm, J. W. Jones,J. Med. Entomol. 38, 130 (2001).

4. K. A. Bernard et al., Emerg. Infect. Dis. 7, 679(2001).

5. V. L. Kulasekera et al., Emerg. Infect. Dis. 7, 722(2001).

6. D. J. Dohm, M. R. Sardelis, M. J. Turell, J. Med.Entomol. 39, 640 (2002).

7. R. S. Nasci et al., Emerg. Infect. Dis. 7, 742 (2001).

8. C. G. Hayes, Ann. N.Y. Acad. Sci. 951, 25 (2001).9. R. E. Harbach, C. Dahl, G. B. White, Proc. Entomol.Soc. Wash. 87, 1 (1985).

10. T. Guillemaud, N. Pasteur, F. Rousset, Proc. R. Soc.London Ser. B. 264, 245 (1997).

11. K. L. Knight, Supplement to the Catalog of theMosquitoes of the World (Diptera: Culicidae), T. S.Foundation, Ed. (Entomological Society of America,College Park, MD, 1978), supplement to vol. 6.

12. K. Byrne, R. A. Nichols, Heredity 82, 7 (1999).13. R. E. Harbach, B. A. Harrison, A. M. Gad, Proc. Ento-mol. Soc. Wash. 86, 521 (1984).

14. P. F. Mattingly et al., Trans. R. Entomol. Soc. Lond.102, 331 (1951).

15. D. X. Zhang, G. M. Hewitt, Mol. Ecol. 12, 563(2003).

16. N. Keyghobadi, M. A. Matrone, G. D. Ebel, L. D.Kramer, D. M. Fonseca, Mol. Ecol. Notes 4, 20(2004).

17. D. M. Fonseca, C. T. Atkinson, R. C. Fleischer, Mol.Ecol. 7, 1617 (1998).

18. F. H. Drummond, Trans. R. Entomol. Soc. Lond. 102,369 (1951).

19. K. Tanaka, K. Mizusawa, E. S. Saugstad, Contrib. Am.Entomol. Inst. 16, 1 (1979).

20. J. K. Pritchard, M. Stephens, P. Donnelly, Genetics155, 945 (2000).

21. A. R. Barr, Am. J. Trop. Med. Hyg. 6, 153 (1957).22. A. J. Cornel et al., J. Med. Entomol. 40, 36 (2003).23. S. Urbanelli, F. Silvestrini, W. K. Reisen, E. De Vito,L. Bullini, J. Med. Entomol. 34, 116 (1997).

24. L. L. Cavalli-Sforza, F. Cavalli-Sforza, The GreatHuman Diasporas: The History of Diversity andEvolution (Addison-Wesley, Reading, MA, 1995).

25. J. de Zulueta, Parassitologia 36, 7 (1994).26. S. Urbanelli et al., in Ecologia, Atti I Congr. Naz.

Soc. Ital. Ecol., A. Moroni, O. Ravera, A. Anelli, Eds.(Zara, Parma, Italy, 1981), pp. 305–316.

27. C. Chevillon, R. Eritja, N. Pasteur, M. Raymond, Genet.Res. 66, 147 (1995).

28. P. D. Crook, N. S. Crowcroft, D. W. Brown, Commun.Dis. Public Health 5, 138 (2002).

29. Z. Hubalek, Viral Immunol. 13, 415 (2000).30. H. M. Savage et al., Am. J. Trop. Med. Hyg. 61, 600(1999).

31. A. G. Richards, Entomol. News 52, 211 (1941).32. We thank N. Becker, C. Curtis, M. Carroll, G. Ebel, C.Evans, M. Santa Ana Gouveia, L. Kramer, G.O’Meara, F. Noguera, and C. Williams for providinginvaluable mosquito samples; J. Smith, M. Matrone,T. Ganguly, and the DNA Sequencing Facility, Uni-versity of Pennsylvania, for technical assistance;and A. Bhandoola and four anonymous reviewersfor comments and valuable suggestions on an ear-lier version of this manuscript. Supported by aNational Research Council Associateship throughthe Walter Reed Army Institute of Research(D.M.F.), by NIH grant nos. U50/CCU220532 and1R01GM063258, and by NSF grant no.DEB-0083944. This material reflects the views ofthe authors and should not be construed to repre-sent those of the Department of the Army or theDepartment of Defense.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/303/5663/1535/DC1Materials and MethodsTables S1 to S8References and Notes

2 December 2003; accepted 16 January 2004

Superfamilies of Evolved andDesigned Networks

Ron Milo, Shalev Itzkovitz, Nadav Kashtan, Reuven Levitt,Shai Shen-Orr, Inbal Ayzenshtat, Michal Sheffer, Uri Alon*

Complex biological, technological, and sociological networks can be of verydifferent sizes and connectivities, making it difficult to compare their struc-tures. Here we present an approach to systematically study similarity in thelocal structure of networks, based on the significance profile (SP) of smallsubgraphs in the network compared to randomized networks. We findseveral superfamilies of previously unrelated networks with very similar SPs.One superfamily, including transcription networks of microorganisms, rep-resents “rate-limited” information-processing networks strongly con-strained by the response time of their components. A distinct superfamilyincludes protein signaling, developmental genetic networks, and neuronalwiring. Additional superfamilies include power grids, protein-structure net-works and geometric networks, World Wide Web links and social networks,and word-adjacency networks from different languages.

Many networks in nature share global prop-erties (1, 2). Their degree sequences (thenumber of edges per node) often follow along-tailed distribution, in which some nodesare much more connected than the average

(3). In addition, natural networks often showthe small-world property of short paths be-tween nodes and highly clustered connections(1, 2, 4). Despite these global similarities,networks from different fields can have verydifferent local structure (5). It was recentlyfound that networks display certain patterns,termed “network motifs,” at much higher fre-quency than expected in randomized net-works (6, 7). In biological networks, thesemotifs were suggested to be recurring circuitelements that carry out key information-processing tasks (6, 8–10).

Departments of Molecular Cell Biology, Physics ofComplex Systems, and Computer Science, WeizmannInstitute of Science, Rehovot 76100, Israel.

*To whom correspondence should be addressed atDepartment of Molecular Cell Biology, Weizmann In-stitute of Science, Rehovot 76100, Israel. E-mail:[email protected]

R E P O R T S

5 MARCH 2004 VOL 303 SCIENCE www.sciencemag.org1538

What led to the pervasiveness of hybridsbetween Cx. pipiens and Cx. molestus inNorth America, but not in Europe and Af-rica, still remains to be determined. Insouthernmost Europe, we identified twopopulations with a few hybrid individuals,as well as populations with pure Cx. pipienssignatures and populations with a mix ofpure Cx. pipiens and pure Cx. molestussignatures (Fig. 2). Indeed, previousallozyme- based studies indicated the exis-tence of populations in Italy with a mix ofthe two forms (26 ) but a very low rate ofhybridization (1%), probably because oftheir different mating behaviors (26 ). Therarity of southern European hybrids and ourfailure to find hybrids in northern Europemay be due to their low fitness and inabilityto diapause. Importantly, the introduction to theUnited States of separate populations of Cx. pipi-ens and Cx. molestus that later hybridized, or ofhybrids from southern Europe, has led to abun-dant and ubiquitous hybrid forms that survive therigors of northern winters.

It is now clear that models derived fromthe U.S. epidemic of WNV (28) may not beapplicable to Eurasia, and vice versa (29).A major factor in all recent outbreaks (Ro-mania 1996, Russia 1999, and UnitedStates 1999) is the involvement of mosqui-toes in the Cx. pipiens complex as theprimary vectors (8, 30). Unlike EuropeanCx. pipiens, U.S. Cx. pipiens appears to bitereadily both avian hosts and humans (2,31). Here we have shown that, across thenortheastern United States, a large propor-tion of individuals are hybrids of human-biter and bird-biter forms. In combinationwith susceptible migrating birds and highlyconcentrated human populations in U.S.cities and suburbs, the prevalence of suchbridge vectors that readily transmit the vi-rus among and between avian hosts andhumans could have created the current ep-idemic conditions.

The present study suggests that changes invectorial capacity and the creation of newefficient vectors may occur with new intro-ductions. In particular, the arrival of hybridAmerican forms in northern Europe has thepotential to radically change the dynamics ofWNV in Europe.

References and Notes1. E. B. Vinagradova, Culex pipiens pipiens Mosquitoes:Taxonomy, Distribution, Ecology, Physiology, Genet-ics, Applied Importance, and Control (Pensoft, Mos-cow, 2000).

2. A. Spielman, Ann. N.Y. Acad. Sci. 951, 220 (2001).3. M. J. Turell, M. L. O’Guinn, D. J. Dohm, J. W. Jones,J. Med. Entomol. 38, 130 (2001).

4. K. A. Bernard et al., Emerg. Infect. Dis. 7, 679(2001).

5. V. L. Kulasekera et al., Emerg. Infect. Dis. 7, 722(2001).

6. D. J. Dohm, M. R. Sardelis, M. J. Turell, J. Med.Entomol. 39, 640 (2002).

7. R. S. Nasci et al., Emerg. Infect. Dis. 7, 742 (2001).

8. C. G. Hayes, Ann. N.Y. Acad. Sci. 951, 25 (2001).9. R. E. Harbach, C. Dahl, G. B. White, Proc. Entomol.Soc. Wash. 87, 1 (1985).

10. T. Guillemaud, N. Pasteur, F. Rousset, Proc. R. Soc.London Ser. B. 264, 245 (1997).

11. K. L. Knight, Supplement to the Catalog of theMosquitoes of the World (Diptera: Culicidae), T. S.Foundation, Ed. (Entomological Society of America,College Park, MD, 1978), supplement to vol. 6.

12. K. Byrne, R. A. Nichols, Heredity 82, 7 (1999).13. R. E. Harbach, B. A. Harrison, A. M. Gad, Proc. Ento-mol. Soc. Wash. 86, 521 (1984).

14. P. F. Mattingly et al., Trans. R. Entomol. Soc. Lond.102, 331 (1951).

15. D. X. Zhang, G. M. Hewitt, Mol. Ecol. 12, 563(2003).

16. N. Keyghobadi, M. A. Matrone, G. D. Ebel, L. D.Kramer, D. M. Fonseca, Mol. Ecol. Notes 4, 20(2004).

17. D. M. Fonseca, C. T. Atkinson, R. C. Fleischer, Mol.Ecol. 7, 1617 (1998).

18. F. H. Drummond, Trans. R. Entomol. Soc. Lond. 102,369 (1951).

19. K. Tanaka, K. Mizusawa, E. S. Saugstad, Contrib. Am.Entomol. Inst. 16, 1 (1979).

20. J. K. Pritchard, M. Stephens, P. Donnelly, Genetics155, 945 (2000).

21. A. R. Barr, Am. J. Trop. Med. Hyg. 6, 153 (1957).22. A. J. Cornel et al., J. Med. Entomol. 40, 36 (2003).23. S. Urbanelli, F. Silvestrini, W. K. Reisen, E. De Vito,L. Bullini, J. Med. Entomol. 34, 116 (1997).

24. L. L. Cavalli-Sforza, F. Cavalli-Sforza, The GreatHuman Diasporas: The History of Diversity andEvolution (Addison-Wesley, Reading, MA, 1995).

25. J. de Zulueta, Parassitologia 36, 7 (1994).26. S. Urbanelli et al., in Ecologia, Atti I Congr. Naz.

Soc. Ital. Ecol., A. Moroni, O. Ravera, A. Anelli, Eds.(Zara, Parma, Italy, 1981), pp. 305–316.

27. C. Chevillon, R. Eritja, N. Pasteur, M. Raymond, Genet.Res. 66, 147 (1995).

28. P. D. Crook, N. S. Crowcroft, D. W. Brown, Commun.Dis. Public Health 5, 138 (2002).

29. Z. Hubalek, Viral Immunol. 13, 415 (2000).30. H. M. Savage et al., Am. J. Trop. Med. Hyg. 61, 600(1999).

31. A. G. Richards, Entomol. News 52, 211 (1941).32. We thank N. Becker, C. Curtis, M. Carroll, G. Ebel, C.Evans, M. Santa Ana Gouveia, L. Kramer, G.O’Meara, F. Noguera, and C. Williams for providinginvaluable mosquito samples; J. Smith, M. Matrone,T. Ganguly, and the DNA Sequencing Facility, Uni-versity of Pennsylvania, for technical assistance;and A. Bhandoola and four anonymous reviewersfor comments and valuable suggestions on an ear-lier version of this manuscript. Supported by aNational Research Council Associateship throughthe Walter Reed Army Institute of Research(D.M.F.), by NIH grant nos. U50/CCU220532 and1R01GM063258, and by NSF grant no.DEB-0083944. This material reflects the views ofthe authors and should not be construed to repre-sent those of the Department of the Army or theDepartment of Defense.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/303/5663/1535/DC1Materials and MethodsTables S1 to S8References and Notes

2 December 2003; accepted 16 January 2004

Superfamilies of Evolved andDesigned Networks

Ron Milo, Shalev Itzkovitz, Nadav Kashtan, Reuven Levitt,Shai Shen-Orr, Inbal Ayzenshtat, Michal Sheffer, Uri Alon*

Complex biological, technological, and sociological networks can be of verydifferent sizes and connectivities, making it difficult to compare their struc-tures. Here we present an approach to systematically study similarity in thelocal structure of networks, based on the significance profile (SP) of smallsubgraphs in the network compared to randomized networks. We findseveral superfamilies of previously unrelated networks with very similar SPs.One superfamily, including transcription networks of microorganisms, rep-resents “rate-limited” information-processing networks strongly con-strained by the response time of their components. A distinct superfamilyincludes protein signaling, developmental genetic networks, and neuronalwiring. Additional superfamilies include power grids, protein-structure net-works and geometric networks, World Wide Web links and social networks,and word-adjacency networks from different languages.

Many networks in nature share global prop-erties (1, 2). Their degree sequences (thenumber of edges per node) often follow along-tailed distribution, in which some nodesare much more connected than the average

(3). In addition, natural networks often showthe small-world property of short paths be-tween nodes and highly clustered connections(1, 2, 4). Despite these global similarities,networks from different fields can have verydifferent local structure (5). It was recentlyfound that networks display certain patterns,termed “network motifs,” at much higher fre-quency than expected in randomized net-works (6, 7). In biological networks, thesemotifs were suggested to be recurring circuitelements that carry out key information-processing tasks (6, 8–10).

Departments of Molecular Cell Biology, Physics ofComplex Systems, and Computer Science, WeizmannInstitute of Science, Rehovot 76100, Israel.

*To whom correspondence should be addressed atDepartment of Molecular Cell Biology, Weizmann In-stitute of Science, Rehovot 76100, Israel. E-mail:[email protected]

R E P O R T S

5 MARCH 2004 VOL 303 SCIENCE www.sciencemag.org1538

To understand the design principles of com-plex networks, it is important to compare the localstructure of networks from different fields. Themain difficulty is that these networks can be ofvastly different sizes [for example, World WideWeb (WWW) hyperlink networks with millionsof nodes and social networks with tens of nodes]and degree sequences. Here, we present an ap-proach for comparing network local structure,based on the significance profile (SP). To calcu-late the SP of a network, the network is comparedto an ensemble of randomized networks with thesame degree sequence. The comparison to ran-domized networks compensates for effects due tonetwork size and degree sequence. For each sub-graph i, the statistical significance is described bythe Z score (11):

Zi ! !Nreali " <Nrandi>)/std(Nrandi)

where Nreali is the number of times the sub-

graph appears in the network, and "Nrandi#and std(Nrandi) are the mean and standarddeviation of its appearances in the random-ized network ensemble. The SP is the vectorof Z scores normalized to length 1:

SPi$Zi/(%Zi2)1/2

The normalization emphasizes the relativesignificance of subgraphs, rather than the ab-solute significance. This is important forcomparison of networks of different sizes,because motifs (subgraphs that occur muchmore often than expected at random) in largenetworks tend to display higher Z scores thanmotifs in small networks (7).

We present in Fig. 1 the SP of the 13possible directed connected triads (triad sig-nificance profile, TSP) for networks fromdifferent fields (12). The TSP of these net-works is almost always insensitive to removal

of 30% of the edges or to addition of 50%new edges at random, demonstrating that it isrobust to missing data or random data errors(SOM Text). Several superfamilies of net-works with similar TSPs emerge from thisanalysis. One superfamily includes sensorytranscription networks that control gene ex-pression in bacteria and yeast in response toexternal stimuli. In these transcription net-works, the nodes represent genes or operonsand the edges represent direct transcriptionalregulation (6, 13–15). Networks from threemicroorganisms, the bacteria Escherichiacoli (6) and Bacillus subtilis (14) and theyeast Saccharomyces cerevisiae (7, 15), wereanalyzed. The networks have very similarTSPs (correlation coefficient c # 0.99). Theyshow one strong motif, triad 7, termed “feed-forward loop.” The feedforward loop hasbeen theoretically and experimentally shown

Fig. 1. The triad significance profile (TSP) of networks from variousdisciplines. The TSP shows the normalized significance level (Z score) foreach of the 13 triads. Networks with similar characteristic profiles aregrouped into superfamilies. The lines connecting the significance valuesserve as guides to the eye. The networks are as follows (where N and Eare the number of nodes and edges, respectively) (12): (i) Direct tran-scription interactions in the bacteria E. coli (6) (TRANSC-E.COLI N$ 424,E$ 519) and B. subtilis (14) (TRANSC-B.SUBTILIS N$ 516, E$ 577) andin the yeast S. cerevisiae [TRANC-YEAST N $ 685, E $ 1052 (7) andTRANSC-YEAST-2 N $ 2341, E$3969 (15)]. (ii) Signal-transductioninteractions in mammalian cells based on the signal transduction knowl-edge environment (STKE, http://stke.sciencemag.org/) (SIGNAL-TRANS-DUCTION N $ 491, E $ 989), transcription networks that guide devel-opment in fruit fly (from the GeNet literature database, www.csa.ru/Inst/gorb_dep/inbios/genet/genet.htm) (TRANSC-DROSOPHILA N$ 110, E$307), endomesoderm development in sea urchin (20) (TRANSC-SEA-

URCHIN N $ 45, E $ 83), and synaptic connections between neurons inC. elegans (NEURONS N $ 280, E $ 2170). (iii) WWW hyperlinksbetween Web pages in the www.nd.edu site (3) (WWW-1 N $ 325729,E $ 1469678), pages related to literary studies of Shakespeare (21)(WWW-2 N $ 277114, E $ 927400), and pages related to tango,specifically the music of Piazzolla (21) (WWW-3 N $ 47870, E $235441); and social networks, including inmates in prison (SOCIAL-1 N$67, E $ 182), sociology freshmen (22) (SOCIAL-2 N $ 28, E $ 110), andcollege students in a course about leadership (SOCIAL-3 N$ 32, E$ 96).(iv) Word-adjacency networks of a text in English (ENGLISH N $ 7724,E $ 46281), French (FRENCH N $ 9424, E $ 24295), Spanish (SPANISHN $ 12642, E $ 45129), and Japanese (JAPANESE N $ 3177, E $ 8300)and a bipartite model with two groups of nodes of sizes N1 $ 1000 andN2 $ 10 with probability of a directed or mutual edge between nodes ofdifferent groups being p $ 0.06 and q $ 0.003, respectively, and no edgesbetween nodes within the same group (BIPARTITE N $ 1010, E $ 1261).

R E P O R T S

www.sciencemag.org SCIENCE VOL 303 5 MARCH 2004 1539

Page 6: Introduction to Network Medicine

The human disease networkKwang-Il Goh*†‡§, Michael E. Cusick†‡¶, David Valle!, Barton Childs!, Marc Vidal†‡¶**, and Albert-Laszlo Barabasi*†‡**

*Center for Complex Network Research and Department of Physics, University of Notre Dame, Notre Dame, IN 46556; †Center for Cancer Systems Biology(CCSB) and ¶Department of Cancer Biology, Dana–Farber Cancer Institute, 44 Binney Street, Boston, MA 02115; ‡Department of Genetics, Harvard MedicalSchool, 77 Avenue Louis Pasteur, Boston, MA 02115; §Department of Physics, Korea University, Seoul 136-713, Korea; and !Department of Pediatrics and theMcKusick–Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205

Edited by H. Eugene Stanley, Boston University, Boston, MA, and approved April 3, 2007 (received for review February 14, 2007)

A network of disorders and disease genes linked by known disorder–gene associations offers a platform to explore in a single graph-theoretic framework all known phenotype and disease gene associ-ations, indicating the common genetic origin of many diseases. Genesassociated with similar disorders show both higher likelihood ofphysical interactions between their products and higher expressionprofiling similarity for their transcripts, supporting the existence ofdistinct disease-specific functional modules. We find that essentialhuman genes are likely to encode hub proteins and are expressedwidely in most tissues. This suggests that disease genes also wouldplay a central role in the human interactome. In contrast, we find thatthe vast majority of disease genes are nonessential and show notendency to encode hub proteins, and their expression pattern indi-cates that they are localized in the functional periphery of thenetwork. A selection-based model explains the observed differencebetween essential and disease genes and also suggests that diseasescaused by somatic mutations should not be peripheral, a predictionwe confirm for cancer genes.

biological networks " complex networks " human genetics " systemsbiology " diseasome

Decades-long efforts to map human disease loci, at first genet-ically and later physically (1), followed by recent positional

cloning of many disease genes (2) and genome-wide associationstudies (3), have generated an impressive list of disorder–geneassociation pairs (4, 5). In addition, recent efforts to map theprotein–protein interactions in humans (6, 7), together with effortsto curate an extensive map of human metabolism (8) and regulatorynetworks offer increasingly detailed maps of the relationshipsbetween different disease genes. Most of the successful studiesbuilding on these new approaches have focused, however, on asingle disease, using network-based tools to gain a better under-standing of the relationship between the genes implicated in aselected disorder (9).

Here we take a conceptually different approach, exploringwhether human genetic disorders and the corresponding diseasegenes might be related to each other at a higher level of cellular andorganismal organization. Support for the validity of this approachis provided by examples of genetic disorders that arise frommutations in more than a single gene (locus heterogeneity). Forexample, Zellweger syndrome is caused by mutations in any of atleast 11 genes, all associated with peroxisome biogenesis (10).Similarly, there are many examples of different mutations in thesame gene (allelic heterogeneity) giving rise to phenotypes cur-rently classified as different disorders. For example, mutations inTP53 have been linked to 11 clinically distinguishable cancer-related disorders (11). Given the highly interlinked internal orga-nization of the cell (12–17), it should be possible to improve thesingle gene–single disorder approach by developing a conceptualframework to link systematically all genetic disorders (the human‘‘disease phenome’’) with the complete list of disease genes (the‘‘disease genome’’), resulting in a global view of the ‘‘diseasome,’’the combined set of all known disorder/disease gene associations.

ResultsConstruction of the Diseasome. We constructed a bipartite graphconsisting of two disjoint sets of nodes. One set corresponds to all

known genetic disorders, whereas the other set corresponds to allknown disease genes in the human genome (Fig. 1). A disorder anda gene are then connected by a link if mutations in that gene areimplicated in that disorder. The list of disorders, disease genes, andassociations between them was obtained from the Online Mende-lian Inheritance in Man (OMIM; ref. 18), a compendium of humandisease genes and phenotypes. As of December 2005, this listcontained 1,284 disorders and 1,777 disease genes. OMIM initiallyfocused on monogenic disorders but in recent years has expandedto include complex traits and the associated genetic mutations thatconfer susceptibility to these common disorders (18). Although thishistory introduces some biases, and the disease gene record is farfrom complete, OMIM represents the most complete and up-to-date repository of all known disease genes and the disorders theyconfer. We manually classified each disorder into one of 22 disorderclasses based on the physiological system affected [see supportinginformation (SI) Text, SI Fig. 5, and SI Table 1 for details].

Starting from the diseasome bipartite graph we generated twobiologically relevant network projections (Fig. 1). In the ‘‘humandisease network’’ (HDN) nodes represent disorders, and twodisorders are connected to each other if they share at least one genein which mutations are associated with both disorders (Figs. 1 and2a). In the ‘‘disease gene network’’ (DGN) nodes represent diseasegenes, and two genes are connected if they are associated with thesame disorder (Figs. 1 and 2b). Next, we discuss the potential ofthese networks to help us understand and represent in a singleframework all known disease gene and phenotype associations.

Properties of the HDN. If each human disorder tends to have adistinct and unique genetic origin, then the HDN would be dis-connected into many single nodes corresponding to specific disor-ders or grouped into small clusters of a few closely related disorders.In contrast, the obtained HDN displays many connections betweenboth individual disorders and disorder classes (Fig. 2a). Of 1,284disorders, 867 have at least one link to other disorders, and 516disorders form a giant component, suggesting that the geneticorigins of most diseases, to some extent, are shared with otherdiseases. The number of genes associated with a disorder, s, has abroad distribution (see SI Fig. 6a), indicating that most disordersrelate to a few disease genes, whereas a handful of phenotypes, suchas deafness (s ! 41), leukemia (s ! 37), and colon cancer (s ! 34),relate to dozens of genes (Fig. 2a). The degree (k) distribution ofHDN (SI Fig. 6b) indicates that most disorders are linked to only

Author contributions: D.V., B.C., M.V., and A.-L.B. designed research; K.-I.G. and M.E.C.performed research; K.-I.G. and M.E.C. analyzed data; and K.-I.G., M.E.C., D.V., M.V., andA.-L.B. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Abbreviations: DGN, disease gene network; HDN, human disease network; GO, GeneOntology; OMIM, Online Mendelian Inheritance in Man; PCC, Pearson correlation coeffi-cient.

**To whom correspondence may be addressed. E-mail: [email protected] or [email protected].

This article contains supporting information online at www.pnas.org/cgi/content/full/0701361104/DC1.

© 2007 by The National Academy of Sciences of the USA

www.pnas.org#cgi#doi#10.1073#pnas.0701361104 PNAS " May 22, 2007 " vol. 104 " no. 21 " 8685–8690

APP

LIED

PHYS

ICA

LSC

IEN

CES

a few other disorders, whereas a few phenotypes such as coloncancer (linked to k ! 50 other disorders) or breast cancer (k ! 30)represent hubs that are connected to a large number of distinctdisorders. The prominence of cancer among the most connecteddisorders arises in part from the many clinically distinct cancersubtypes tightly connected with each other through common tumorrepressor genes such as TP53 and PTEN.

Although the HDN layout was generated independently of anyknowledge on disorder classes, the resulting network is naturallyand visibly clustered according to major disorder classes. Yet, thereare visible differences between different classes of disorders.Whereas the large cancer cluster is tightly interconnected due to themany genes associated with multiple types of cancer (TP53, KRAS,ERBB2, NF1, etc.) and includes several diseases with strong pre-disposition to cancer, such as Fanconi anemia and ataxia telangi-ectasia, metabolic disorders do not appear to form a single distinctcluster but are underrepresented in the giant component andoverrepresented in the small connected components (Fig. 2a). Toquantify this difference, we measured the locus heterogeneity ofeach disorder class and the fraction of disorders that are connectedto each other in the HDN (see SI Text). We find that cancer andneurological disorders show high locus heterogeneity and alsorepresent the most connected disease classes, in contrast withmetabolic, skeletal, and multiple disorders that have low geneticheterogeneity and are the least connected (SI Fig. 7).

Properties of the DGN. In the DGN, two disease genes are connectedif they are associated with the same disorder, providing a comple-

mentary, gene-centered view of the diseasome. Given that the linkssignify related phenotypic association between two genes, theyrepresent a measure of their phenotypic relatedness, which could beused in future studies, in conjunction with protein–protein inter-actions (6, 7, 19), transcription factor-promoter interactions (20),and metabolic reactions (8), to discover novel genetic interactions.In the DGN, 1,377 of 1,777 disease genes are connected to otherdisease genes, and 903 genes belong to a giant component (Fig. 2b).Whereas the number of genes involved in multiple diseases de-creases rapidly (SI Fig. 6d; light gray nodes in Fig. 2b), severaldisease genes (e.g., TP53, PAX6) are involved in as many as 10disorders, representing major hubs in the network.

Functional Clustering of HDN and DGN. To probe how the topologyof the HDN and GDN deviates from random, we randomlyshuffled the associations between disorders and genes, while keep-ing the number of links per each disorder and disease gene in thebipartite network unchanged. Interestingly, the average size of thegiant component of 104 randomized disease networks is 643 " 16,significantly larger than 516 (P # 10$4; for details of statisticalanalyses of the results reported hereafter, see SI Text), the actualsize of the HDN (SI Fig. 6c). Similarly, the average size of the giantcomponent from randomized gene networks is 1,087 " 20 genes,significantly larger than 903 (P # 10$4), the actual size of the DGN(SI Fig. 6e). These differences suggest important pathophysiologicalclustering of disorders and disease genes. Indeed, in the actualnetworks disorders (genes) are more likely linked to disorders(genes) of the same disorder class. For example, in the HDN there

AR

ATM

BRCA1

BRCA2

CDH1

GARS

HEXB

KRAS

LMNA

MSH2

PIK3CA

TP53

MAD1L1

RAD54L

VAPB

CHEK2

BSCL2

ALS2

BRIP1

Androgen insensitivity

Breast cancer

Perineal hypospadias

Prostate cancer

Spinal muscular atrophy

Ataxia-telangiectasia

Lymphoma

T-cell lymphoblastic leukemia

Ovarian cancer

Papillary serous carcinoma

Fanconi anemia

Pancreatic cancer

Wilms tumor

Charcot-Marie-Tooth disease

Sandhoff disease

Lipodystrophy

Amyotrophic lateral sclerosis

Silver spastic paraplegia syndrome

Spastic ataxia/paraplegia

AR

ATM

BRCA1

BRCA2

CDH1

GARS

HEXB

KRAS

LMNA

MSH2

PIK3CA

TP53

MAD1L1

RAD54L

VAPB

CHEK2

BSCL2

ALS2

BRIP1

Androgen insensitivity

Breast cancer

Perineal hypospadiasProstate cancer

Spinal muscular atrophy

Ataxia-telangiectasia

Lymphoma

T-cell lymphoblastic leukemia

Ovarian cancer

Papillary serous carcinomaFanconi anemia

Pancreatic cancer

Wilms tumor

Charcot-Marie-Tooth disease

Sandhoff disease

Lipodystrophy

Amyotrophic lateral sclerosis

Silver spastic paraplegia syndromeSpastic ataxia/paraplegia

Human Disease Network(HDN)

Disease Gene Network(DGN)

disease genomedisease phenome

DISEASOME

Fig. 1. Construction of the diseasome bipartite network. (Center) A small subset of OMIM-based disorder–disease gene associations (18), where circles and rectanglescorrespond to disorders and disease genes, respectively. A link is placed between a disorder and a disease gene if mutations in that gene lead to the specific disorder.Thesizeofacircle isproportional tothenumberofgenesparticipating inthecorrespondingdisorder,andthecolorcorrespondstothedisorderclass towhichthediseasebelongs. (Left) The HDN projection of the diseasome bipartite graph, in which two disorders are connected if there is a gene that is implicated in both. The width ofa link is proportional to the number of genes that are implicated in both diseases. For example, three genes are implicated in both breast cancer and prostate cancer,resulting in a link of weight three between them. (Right) The DGN projection where two genes are connected if they are involved in the same disorder. The width ofa link is proportional to the number of diseases with which the two genes are commonly associated. A full diseasome bipartite map is provided as SI Fig. 13.

8686 ! www.pnas.org"cgi"doi"10.1073"pnas.0701361104 Goh et al.

Goh et al., PNAS 2007

GENES AND DISEASES

Page 7: Introduction to Network Medicine

Asthma

Atheroscierosis

Bloodgroup

Breastcancer

Complement_componentdeficiency

Cardiomyopathy

Cataract

Charcot-Marie-Toothdisease

Coloncancer

Deafness

Diabetesmellitus

Epidermolysisbullosa

Epilepsy

Fanconianemia

Gastriccancer Hypertension

Leighsyndrome

Leukemia

Lymphoma

Mentalretardation

Musculardystrophy

Myocardialinfarction

Myopathy

Obesity

Parkinsondisease

Prostatecancer

Retinitispigmentosa

SpherocytosisSpinocereballar

ataxia

Stroke

Thyroidcarcinoma

Zellwegersyndrome

APC

COL2A1

ACE

PAX6

ERBB2

FBN1

FGFR3

FGFR2

GJB2

GNAS

KIT

KRAS

LRP5

MSH2

MEN1

NF1

PTEN

SCN4A

TP53

ARX

a

b

Human Disease Network

Disease Gene NetworkDisorder Class

BoneCancerCardiovascularConnective tissueDermatologicalDevelopmentalEar, Nose, ThroatEndocrineGastrointestinalHematologicalImmunologicalMetabolicMuscularNeurologicalNutritionalOphthamologicalPsychiatricRenalRespiratorySkeletalmultipleUnclassified

Node size

15

1015

21

25

30

34

41

Hirschprungdisease

Trichothio-dystrophy

Alzheimerdisease

Heinzbody

anemia

Bethlemmyopathy

Hemolyticanemia

Ataxia-telangiectasia

Pseudohypo-aldosteronism

Fig. 2. The HDN and the DGN. (a) In the HDN, each node corresponds to a distinct disorder, colored based on the disorder class to which it belongs, the nameof the 22 disorder classes being shown on the right. A link between disorders in the same disorder class is colored with the corresponding dimmer color and linksconnecting different disorder classes are gray. The size of each node is proportional to the number of genes participating in the corresponding disorder (see key),and the link thickness is proportional to the number of genes shared by the disorders it connects. We indicate the name of disorders with !10 associated genes,as well as those mentioned in the text. For a complete set of names, see SI Fig. 13. (b) In the DGN, each node is a gene, with two genes being connected if theyare implicated in the same disorder. The size of each node is proportional to the number of disorders in which the gene is implicated (see key). Nodes are lightgray if the corresponding genes are associated with more than one disorder class. Genes associated with more than five disorders, and those mentioned in thetext, are indicated with the gene symbol. Only nodes with at least one link are shown.

Goh et al. PNAS ! May 22, 2007 ! vol. 104 ! no. 21 ! 8687

APP

LIED

PHYS

ICA

LSC

IEN

CES

Page 8: Introduction to Network Medicine

Leading Edge

Review

Interactome Networks and Human DiseaseMarc Vidal,1,2,* Michael E. Cusick,1,2 and Albert-Laszlo Barabasi1,3,4,*1Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA 02215, USA2Department of Genetics, Harvard Medical School, Boston, MA 02115, USA3Center for Complex Network Research (CCNR) and Departments of Physics, Biology and Computer Science, Northeastern University,Boston, MA 02115, USA4Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA*Correspondence: [email protected] (M.V.), [email protected] (A.-L.B.)DOI 10.1016/j.cell.2011.02.016

Complex biological systems and cellular networks may underlie most genotype to phenotyperelationships. Here, we review basic concepts in network biology, discussing different types ofinteractome networks and the insights that can come from analyzing them. We elaborate on whyinteractome networks are important to consider in biology, how they can bemapped and integratedwith each other, what global properties are starting to emerge from interactome network models,and how these properties may relate to human disease.

IntroductionSince the advent of molecular biology, considerable progresshas been made in the quest to understand the mechanismsthat underlie human disease, particularly for genetically inheriteddisorders. Genotype-phenotype relationships, as summarized inthe Online Mendelian Inheritance in Man (OMIM) database (Am-berger et al., 2009), include mutations in more than 3000 humangenes known to be associated with one or more of over 2000human disorders. This is a truly astounding number of geno-type-phenotype relationships considering that a mere threedecades have passed since the initial description of RestrictionFragment Length Polymorphisms (RFLPs) as molecular markersto map genetic loci of interest (Botstein et al., 1980), onlytwo decades since the announcement of the first positionalcloning experiments of disease-associated genes using RFLPs(Amberger et al., 2009), and just one decade since the releaseof the first reference sequences of the human genome (Landeret al., 2001; Venter et al., 2001). For complex traits, the informa-tion gathered by recent genome-wide association studiessuggests high-confidence genotype-phenotype associationsbetween close to 1000 genomic loci and one or more of overone hundred diseases, including diabetes, obesity, Crohn’sdisease, and hypertension (Altshuler et al., 2008). The discoveryof genomic variations involved in cancer, inherited in the germ-line or acquired somatically, is equally striking, with hundredsof human genes found linked to cancer (Stratton et al., 2009).In light of new powerful technological developments such asnext-generation sequencing, it is easily imaginable that a catalogof nearly all human genomic variations, whether deleterious,advantageous, or neutral, will be available within our lifetime.

Despite the natural excitement emerging from such a hugebody of information, daunting challenges remain. Practically,the genomic revolution has, thus far, seldom translated directlyinto the development of new therapeutic strategies, and themechanisms underlying genotype-phenotype relationshipsremain only partially explained. Assuming that, with time, mosthuman genotypic variations will be described together with

phenotypic associations, there would still be major problemsto fully understand andmodel human genetic variations and theirimpact on diseases.To understand why, consider the ‘‘one-gene/one-enzyme/

one-function’’ concept originally framed by Beadle and Tatum(Beadle and Tatum, 1941), which holds that simple, linearconnections are expected between the genotype of an organismand its phenotype. But the reality is that most genotype-pheno-type relationships arise from a much higher underlying com-plexity. Combinations of identical genotypes and nearly identicalenvironments do not always give rise to identical phenotypes.The very coining of the words ‘‘genotype’’ and ‘‘phenotype’’ byJohannsen more than a century ago derived from observationsthat inbred isogenic lines of bean plants grown in well-controlledenvironments give rise to pods of different size (Johannsen,1909). Identical twins, although strikingly similar, neverthelessoften exhibit many differences (Raser and O’Shea, 2005). Like-wise, genotypically indistinguishable bacterial or yeast cellsgrown side by side can express different subsets of transcriptsand gene products at any given moment (Elowitz et al., 2002;Blake et al., 2003; Taniguchi et al., 2010). Even straightforwardMendelian traits are not immune to complex genotype-pheno-type relationships. Incomplete penetrance, variable expressivity,differences in age of onset, and modifier mutations are morefrequent than generally appreciated (Perlis et al., 2010).We, along with others, argue that the way beyond these chal-

lenges is to decipher the properties of biological systems, and inparticular, those of molecular networks taking place within cells.As is becoming increasingly clear, biological systems andcellular networks are governed by specific laws and principles,the understanding of which will be essential for a deeper com-prehension of biology (Nurse, 2003; Vidal, 2009).Accordingly, our goal is to review key aspects of how complex

systems operate inside cells. Particularly, we will review how byinteracting with each other, genes and their products formcomplex networks within cells. Empirically determining andmodeling cellular networks for a few model organisms and for

986 Cell 144, March 18, 2011 ª2011 Elsevier Inc.

human has provided a necessary scaffold toward understandingthe functional, logical and dynamical aspects of cellular systems.Importantly, wewill discuss the possibility that phenotypes resultfrom perturbations of the properties of cellular systems andnetworks. The link between network properties and phenotypes,including susceptibility to human disease, appears to be at leastas important as that between genotypes and phenotypes(Figure 1).

Cells as Interactome NetworksSystems biology can be said to have originated more than halfa century ago, when a few pioneers initially formulated a theoret-ical framework according to which multiscale dynamic complexsystems formed by interacting macromolecules could underliecellular behavior (Vidal, 2009). These theoretical systems biologyideas were elaborated upon at a time when there was littleknowledge of the exact nature of the molecular components ofbiology, let alone any detailed information on functional andbiophysical interactions between them. While greatly inspira-tional to a few specialists, systems concepts remained largelyignored by most molecular biologists, at least until empiricalobservations could be gathered to validate them. Meanwhile,theoretical representations of cellular organization evolvedsteadily, closely following the development of ever improvingmolecular technologies. The organizational view of the cellchanged from being merely a ‘‘bag of enzymes’’ to a web ofhighly interrelated and interconnected organelles (Robinsonet al., 2007). Cells can accordingly be envisioned as complexwebs of macromolecular interactions, the full complement ofwhich constitutes the ‘‘interactome’’ network. At the dawn ofthe 21st century, with most components of cellular networkshaving been identified, the basic ideas of systems and networkbiology are ready to be experimentally tested and applied torelevant biological problems.

Mapping Interactome NetworksNetwork science deals with complexity by ‘‘simplifying’’ com-plex systems, summarizing themmerely as components (nodes)and interactions (edges) between them. In this simplifiedapproach, the functional richness of each node is lost. Despiteor even perhaps because of such simplifications, useful discov-eries can be made. As regards cellular systems, the nodes aremetabolites and macromolecules such as proteins, RNA mole-cules and gene sequences, while the edges are physical,biochemical and functional interactions that can be identifiedwith a plethora of technologies. One challenge of networkbiology is to provide maps of such interactions using systematicand standardized approaches and assays that are as unbiasedas possible. The resulting ‘‘interactome’’ networks, the networksof interactions between cellular components, can serve as scaf-fold information to extract global or local graph theory proper-ties. Once shown to be statistically different from randomizednetworks, such properties can then be related back to a betterunderstanding of biological processes. Potentially powerfuldetails of each interaction in the network are left aside, includingfunctional, dynamic and logical features, as well as biochemicaland structural aspects such as protein post-translational modifi-cations or allosteric changes. The power of the approach residesprecisely in such simplification of molecular detail, which allowsmodeling at the scale of whole cells.Early attempts at experimental proteome-scale interactome

network mapping in the mid-1990s (Finley and Brent, 1994; Bar-tel et al., 1996; Fromont-Racine et al., 1997; Vidal, 1997) wereinspired by several conceptual advances in biology. The bio-chemistry of metabolic pathways had already given rise tocellular scale representations of metabolic networks. The dis-covery of signaling pathways and cross-talk between them, aswell as large molecular complexes such as RNA polymerases,all involving innumerable physical protein-protein interactions,suggested the existence of highly connected webs of interac-tions. Finally, the rapidly growing identification ofmany individualinteractions between transcription factors and specific DNAregulatory sequences involved in the regulation of gene expres-sion raised the question of how transcriptional regulation isglobally organized within cells.Three distinct approaches have been used since to capture

interactome networks: (1) compilation or curation of alreadyexisting data available in the literature, usually obtained fromone or just a few types of physical or biochemical interactions(Roberts, 2006); (2) computational predictions based on avail-able ‘‘orthogonal’’ information apart from physical or biochem-ical interactions, such as sequence similarities, gene-orderconservation, copresence and coabsence of genes in com-pletely sequenced genomes and protein structural information(Marcotte and Date, 2001); and (3) systematic, unbiased high-throughput experimental mapping strategies applied at the scaleof whole genomes or proteomes (Walhout and Vidal, 2001).These approaches, though complementary, differ greatly in thepossible interpretations of the resultingmaps. Literature-curatedmaps present the advantage of using already available informa-tion, but are limited by the inherently variable quality of thepublished data, the lack of systematization, and the absenceof reporting of negative data (Cusick et al., 2009; Turinsky

Figure 1. Perturbations in Biological Systems and Cellular NetworksMay Underlie Genotype-Phenotype RelationshipsBy interacting with each other, genes and their products form complex cellularnetworks. The link between perturbations in network and systems propertiesand phenotypes, such as Mendelian disorders, complex traits, and cancer,might be as important as that between genotypes and phenotypes.

Cell 144, March 18, 2011 ª2011 Elsevier Inc. 987

et al., 2010). Computational prediction maps are fast and effi-cient to implement, and usually include satisfyingly largenumbers of nodes and edges, but are necessarily imperfectbecause they use indirect information (Plewczynski andGinalski,2009). While high-throughput maps attempt to describe unbi-ased, systematic, and well-controlled data, they were initiallymore difficult to establish, although recent technologicaladvances suggest that near completion can be reached withina few years for highly reliable, comprehensive protein-proteininteraction and gene regulatory network maps for human (Ven-katesan et al., 2009).

The mapping and analysis of interactome networks formodel organisms was instrumental in getting to this point.Such efforts provided, and will continue to provide, both neces-sary pioneering technologies and crucial conceptual insights. Aswith other aspects of biology, advancements inmapping of inter-actome networks would have been minimal without a focus onmodel organisms (Davis, 2004). The field of interactomemapping has been helped by developments in several modelorganisms, primarily the yeast, Saccharomyces cerevisiae, thefly, Drosophila melanogaster, and the worm, Caenorhabditiselegans (Figure 2). For instance, genome-wide resources suchas collections of all, or nearly all, open reading frames(ORFeomes) were first generated for these model organisms,both because their genomes are the best annotated andbecause there are fewer complications, such as the high numberof splice variants in human and other mammals. ORFeomeresources allow efficient transfer of large numbers of ORFs intovectors suitable for diverse interactome mapping technologies(Hartley et al., 2000;Walhout et al., 2000b). Moreover, gene abla-tion technologies, knockouts (for yeast) and knockdowns byRNAi (for worms and flies) and transposon insertions (for plants),

were discovered in and are being applied genome-wide for thesemodel organisms (Mohr et al., 2010).

Metabolic NetworksMetabolic network maps attempt to comprehensively describeall possible biochemical reactions for a particular cell ororganism (Schuster et al., 2000; Edwards et al., 2001). In manyrepresentations of metabolic networks, nodes are biochemicalmetabolites and edges are either the reactions that convertone metabolite into another or the enzymes that catalyze thesereactions (Jeong et al., 2000; Schuster et al., 2000) (Figure 2).Edges can be directed or undirected, depending on whethera given reaction is reversible or not. In specific cases of meta-bolic network modeling, the converse situation can be used,with nodes representing enzymes and edges pointing to adja-cent pairs of enzymes for which the product of one is thesubstrate of the other (Lee et al., 2008).Although large metabolic pathway charts have existed for

decades (Kanehisa et al., 2008), nearly complete metabolicnetwork maps required the completion of full genomesequencing together with accurate gene annotation tools (Ober-hardt et al., 2009). Network construction is manual with compu-tational assistance, involving: (1) the meticulous curation of largenumbers of publications, each describing experimental resultsregarding one or several metabolic reactions characterizedfrom purified or reconstituted enzymes, and (2) when necessary,the compilation of predicted reactions from studies of ortholo-gous enzymes experimentally characterized in other species.Assembly of the union of all experimentally demonstratedand predicted reactions gives rise to proteome-scale networkmaps (Mo and Palsson, 2009). Such maps have beencompiled for numerous species, predominantly prokaryotes

Figure 2. Networks in Cellular SystemsTo date, cellular networks are most available for the ‘‘super-model’’ organisms (Davis, 2004) yeast, worm, fly, and plant. High-throughput interactome mappingrelies upon genome-scale resources such as ORFeome resources. Several types of interactome networks discussed are depicted. In a protein interactionnetwork, nodes represent proteins and edges represent physical interactions. In a transcriptional regulatory network, nodes represent transcription factors(circular nodes) or putative DNA regulatory elements (diamond nodes); and edges represent physical binding between the two. In a disease network, nodesrepresent diseases, and edges represent genemutations of which are associated with the linked diseases. In a virus-host network, nodes represent viral proteins(square nodes) or host proteins (round nodes), and edges represent physical interactions between the two. In a metabolic network, nodes represent enzymes,and edges represent metabolites that are products or substrates of the enzymes. The network depictions seem dense, but they represent only small portions ofavailable interactome network maps, which themselves constitute only a few percent of the complete interactomes within cells.

988 Cell 144, March 18, 2011 ª2011 Elsevier Inc.

Cell 2011

DISEASES AS NETWORK PERTURBATIONS

Page 9: Introduction to Network Medicine

Most cellular components exert their functions through interactions with other cellular components, which can be located either in the same cell or across cells, and even across organs. In humans, the potential complexity of the resulting network — the human interactome — is daunting: with ~25,000 protein-coding genes, ~1,000 metabolites and an undefined number of distinct proteins1 and functional RNA molecules, the number of cellular components that serve as the nodes of the inter-actome easily exceeds 100,000. The number of function-ally relevant interactions between the components of this network, representing the links of the interactome, is expected to be much larger2.

This inter- and intracellular interconnectivity implies that the impact of a specific genetic abnormality is not restricted to the activity of the gene product that carries it, but can spread along the links of the network and alter the activity of gene products that otherwise carry no defects. Therefore, an understanding of a gene’s net-work context is essential in determining the phenotypic impact of defects that affect it3,4. Following on from this principle, a key hypothesis underlying this Review is that a disease phenotype is rarely a consequence of an abnormality in a single effector gene product, but reflects various pathobiological processes that inter-act in a complex network. A corollary of this widely held hypothesis is that the interdependencies among a cell’s molecular components lead to deep functional, molecular and causal relationships among apparently distinct phenotypes.

Network-based approaches to human disease have multiple potential biological and clinical applications. A better understanding of the effects of cellular intercon-nectedness on disease progression may lead to the iden-tification of disease genes and disease pathways, which, in turn, may offer better targets for drug development. These advances may also lead to better and more accurate biomarkers to monitor the functional integrity of net-works that are perturbed by diseases as well as to better disease classification. Here we present an overview of the organizing principles that govern cellular networks and the implications of these principles for understand-ing disease. These principles and the tools and method-ologies that are derived from them are facilitating the emergence of a body of knowledge that is increasingly referred to as network medicine5–7.

The human interactomeAlthough much of our understanding of cellular net-works is derived from model organisms, the past dec-ade has seen an exceptional growth in human-specific molecular interaction data8. Most attention has been directed towards molecular networks, including protein interaction networks, whose nodes are proteins that are linked to each other by physical (binding) interactions9,10; metabolic networks, whose nodes are metabolites that are linked if they participate in the same biochemi-cal reactions11–13; regulatory networks, whose directed links represent either regulatory relationships between a transcription factor and a gene14, or post-translational

*Center for Complex Networks Research and Department of Physics, Northeastern University, 110 Forsyth Street, 111 Dana Research Center, Boston, Massachusetts 02115, USA.‡Center for Cancer Systems Biology, Dana-Farber Cancer Institute, 44 Binney Street, Boston, Massachusetts 02115, USA.§Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, 75 Francis Street, Boston, Massachusetts 02115, USA.||Department of Cellular and Molecular Pharmacology, University of California, 1700 4th Street, Byers Hall 309, Box 2530, San Francisco, California 94158, USA.Correspondence to A.-L.B. e-mail: [email protected]:10.1038/nrg2918

Network medicine: a network-based approach to human diseaseAlbert-László Barabási*‡§, Natali Gulbahce*‡|| and Joseph Loscalzo§

Abstract | Given the functional interdependencies between the molecular components in a human cell, a disease is rarely a consequence of an abnormality in a single gene, but reflects the perturbations of the complex intracellular and intercellular network that links tissue and organ systems. The emerging tools of network medicine offer a platform to explore systematically not only the molecular complexity of a particular disease, leading to the identification of disease modules and pathways, but also the molecular relationships among apparently distinct (patho)phenotypes. Advances in this direction are essential for identifying new disease genes, for uncovering the biological significance of disease-associated mutations identified by genome-wide association studies and full-genome sequencing, and for identifying drug targets and biomarkers for complex diseases.

REVIEWS

56 | JANUARY 2011 | VOLUME 12 www.nature.com/reviews/genetics

© 2011 Macmillan Publishers Limited. All rights reserved

'UUGPVKCNPQP�FKUGCUGIGPGU�������

'UUGPVKCNIGPGU�������

*WOCP�IGPGU�`������

&KUGCUGIGPGU�������

'UUGPVKCNFKUGCUGIGPGU�����

0QP�GUUGPVKCNFKUGCUGIGPGU�������

%GPVTG

2GTKRJGT[C D'UUGPVKCNRTQVGKPU

&KUGCUGRTQVGKPU

0CVWTG�4GXKGYU�^�)GPGVKEU

DegreeThe degree of a node is the number of links that connect to it. The degree of a protein could represent the number of proteins with which it interacts with, whereas the degree of a disease may represent the number of other diseases that are associated with the same gene or that have a common phenotype.

Module (or community)A dense subgraph on the network that often represents a set of nodes that have a joint role. In biology, a module could correspond to a group of molecules that interact with each other to achieve some common function.

of a particular functional module51, intimating that a functional module is also a disease module. However, several unique characteristics of disease modules are important to bear in mind. First, a disease module may not be identical to, but is likely to overlap with, the top-ological and/or functional modules. Second, a disease module is defined in relation to a particular disease and, accordingly, each disease has its own unique module. Last, a gene, protein or metabolite can be implicated in several disease modules, which means that different dis-ease modules can overlap. These characteristics aid the disease module identification process, an important step of network medicine (FIG. 3).

The emergence of a disease is therefore viewed as a combinatorial problem in which many different defects and perturbations result in a similar disease phenotype, provided that they alter the activity of the disease mod-ule. Such combinatorial disease mechanisms are well documented in cancer52, but the utility of the disease module hypothesis extends beyond polygenic diseases and is important even in some monogenic diseases. For example, sickle cell disease, a classic Mendelian disorder, is caused by a single point mutation at position 6 of the β-chain of haemoglobin. Still, this simple biochemical phenotype and its corresponding monogenotype do not yield a single pathophenotype: individuals with sickle cell disease can present with painful crises, osteonecro-sis, acute chest syndrome, stroke, profound anaemia or mild asymptomatic anaemia. Thus, the underlying dis-ease module is likely to include all disease-modifying genes (for example, haemoglobin F) that mediate vari-ous epigenetic, transcriptional and post-translational phenomena. An important step of network-based approaches to disease is, therefore, to identify the dis-ease module for the pathophenotype of interest, which, in turn, can guide further experimental work towards uncovering the disease mechanism, predicting disease genes and influencing drug development.

Predicting disease genesDisease-associated genes have generally been identified using linkage mapping or, more recently, genome-wide association (GWA) studies53. Both methodologies can suggest large numbers of disease-gene candidates, but identifying the particular gene and the causal muta-tion remains difficult. Recently, a series of increasingly sophisticated network-based tools have been devel-oped to predict potential disease genes; these tools can be loosely grouped into three categories, as discussed below (FIG. 4).

Linkage methods. These methods assume that the direct interaction partners of a disease protein are likely to be associated with the same disease phenotype45,54–56. Indeed, for one disease locus, the set of genes within the locus whose products interacted with a known disease protein were shown to be tenfold enriched in true disease-causing genes45. By also considering cel-lular localization, this approach led to a 1,000-fold enrichment over a random selection. On this basis, the authors predicted and confirmed the involvement of Janus kinase 3 (JAK3) in severe combined immunode-ficiency syndrome owing to its interaction with known disease-associated proteins.

Disease module-based methods. A second set of meth-ods assumes that all cellular components that belong to the same topological, functional or disease module have a high likelihood of being involved in the same disease57,58. These methods start with identifying the disease modules and inspecting their members as potential disease genes. Disease modules can be identi-fied on the basis of currently available data using bio-informatics approaches (FIG. 3). Briefly, this strategy involves constructing the interactome in the tissue and cell line of interest and identifying a subnetwork, or dis-ease module, that contains most of the disease-associated genes. Disease modules are then validated by, for exam-ple, showing that the genes in a module have related functions or have correlated expression patterns.

Variants of this methodology have been applied to a wide range of diseases and pathophenotypes, includ-ing several different types of cancer59–66, neurological diseases67–69, cardiovascular diseases68,70, systemic inflam-mation71,72, obesity73–75, asthma76, type 2 diabetes77 and chronic fatigue syndrome78. For example, Taylor et al.66 identified disease-associated protein interaction modules for adenocarcinoma of the breast, providing useful indi-cators for predicting breast cancer outcome. Similarly, Chen et al.73 identified subnetworks in liver and adipose tissues that contain genes for which variants associated with obesity and diabetes have been identified. The results confirmed a previously proposed connection between obesity and a macrophage-enriched metabolic subnetwork, validating three previously unknown genes, lipoprotein lipase (Lpl), β-lactamase (Lactb) and protein phosphatase, Mg2+/Mn2+ dependent, 1L (Ppm1l), as obes-ity genes in transgenic mice. The disease module-based approach has also been useful in exploring pathogen-induced phenotypes79–81 (N.G. et al., unpublished data).

Figure 1 | Disease and essential genes in the interactome. a | Of the approximately 25,000 human genes, 2,418 are associated with specific diseases. The figure shows the overlap between the 1,777 disease-associated genes that were known42 in 2007 and the 1,665 genes that are in utero essential, that is, their absence is associated with embryonic lethality. b | Schematic diagram of the differences between essential and non-essential disease genes. Non-essential disease genes (shown as blue nodes) are found to segregate at the network periphery, whereas in utero essential genes (shown as red nodes) tend to be at the functional centre (encoding hubs and expressed in many tissues) of the interactome. Part a is reproduced, with permission, from REF. 42

(2007) National Academy of Sciences.

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 12 | JANUARY 2011 | 59

© 2011 Macmillan Publishers Limited. All rights reserved

C��6QRQNQIKECN�OQFWNG D��(WPEVKQPCN�OQFWNG E��&KUGCUG�OQFWNG

6QRQNQIKECNN[�ENQUGIGPGU�QT�RTQFWEVU�

(WPEVKQPCNN[�UKOKNCTIGPGU�QT�RTQFWEVU�

&KUGCUG�IGPGUQT�RTQFWEVU�

$KFKTGEVKQPCNKPVGTCEVKQPU

&KTGEVGFKPVGTCEVKQPU

0CVWTG�4GXKGYU�^�)GPGVKEU

EdgeticEdgetic perturbations denote mutations that do not result in the complete loss of a gene product, but affect one or several interactions (and thus functions) of a protein. From a network perspective, an edgetic perturbation removes one or several links, but leaves the other links and the node unaffected.

Shared gene hypothesis and the human disease network. The linkage of a gene to different disease pathopheno-types often indicates that these diseases have a common genetic origin. Motivated by this hypothesis, Goh et al.42 used the gene–disease associations that are collected in the OMIM database to build a network of diseases that are linked if they share one or more genes. In the obtained human disease network (HDN), 867 of 1,284 diseases with an associated gene are connected to at least one other disease, and 516 of them belong to a single disease cluster (FIG. 5). The clustering of nodes of similar colour in FIG. 5, denoting the disease class, reflects the fact that similar pathophenotypes have a higher likeli-hood of sharing genes than do pathophenotypes that belong to different disease classes. For example, cancers form a tightly interconnected and easily detectable clus-ter, which is held together by a small group of genes that are associated with multiple cancers.

To determine whether the sharing of genes has con-sequences for disease occurrence in populations, the comorbidity between linked disease pairs has been examined90 (FIG. 5). This analysis indicates that a patient is twice as likely to develop a particular disease if that disease shares a gene with the patient’s primary disease. But many disease pairs that share genes do not show sig-nificant comorbidity. One explanation is that different mutations in the same gene can have different effects on the gene product, and therefore different pathological consequences91 that are organ and context dependent. Such ‘edgetic’ alleles affect a specific subset of links in the interactome92. Consistent with this view, disease pairs that are associated with mutations that affect

the same functional domain of a protein show higher comorbidity than do disease pairs with mutations that occur in different functional domains90 (FIG. 5).

Shared metabolic pathway hypothesis and the meta-bolic disease network. An enzymatic defect that affects the flux of one reaction can potentially affect the fluxes of all downstream reactions in the same path-way, leading to disease phenotypes that are normally associated with these downstream reactions. Thus, for metabolic diseases, links that are induced by shared metabolic pathways are expected to be more relevant than are links based on shared genes. In support of this hypothesis, Lee et al.93 constructed a metabolic disease network (MDN) in which two disorders are connected if the enzymes associated with them cata-lyse adjacent reactions (FIG. 5b). The visually apparent clustering of the MDN mirrors distinct metabolic path-ways. For example, purine metabolism consists of 62 reactions associated with 33 diseases, including nucle-oside phosphorylase deficiency and congenital dys-erythropoietic anaemia, which form a visually distinct cluster. Comorbidity analysis confirms the functional relevance of metabolic coupling: disease pairs that are linked in the MDN have a 1.8-fold increased comor-bidity compared to disease pairs that are not linked metabolically93. Comorbidity is even more pronounced if the fluxes of the reactions that are catalysed by the respective disease genes are themselves coupled; that is, changes in one flux induce significant changes in the other flux, even if the corresponding reactions are not adjacent.

Figure 2 | Disease modules. Schematic diagram of the three modularity concepts that are discussed in this Review. a | Topological modules correspond to locally dense neighbourhoods of the interactome, such that the nodes of the module show a higher tendency to interact with each other than with nodes outside the module. As such, topological modules represent a pure network property. b | Functional modules correspond to network neighbourhoods in which there is a statistically significant segregation of nodes of related function. Thus, a functional module requires us to define some nodal characteristics (shown as grey nodes) and relies on the hypothesis that nodes that are involved in closely related cellular functions tend to interact with each other and are therefore located in the same network neighbourhood. c | A disease module represents a group of nodes whose perturbation (mutations, deletions, copy number variations or expression changes) can be linked to a particular disease phenotype, shown as red nodes. The tacit assumption in network medicine is that the topological, functional and disease modules overlap, so that functional modules correspond to topological modules and a disease can be viewed as the breakdown of a functional module.

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 12 | JANUARY 2011 | 61

© 2011 Macmillan Publishers Limited. All rights reserved

Most cellular components exert their functions through interactions with other cellular components, which can be located either in the same cell or across cells, and even across organs. In humans, the potential complexity of the resulting network — the human interactome — is daunting: with ~25,000 protein-coding genes, ~1,000 metabolites and an undefined number of distinct proteins1 and functional RNA molecules, the number of cellular components that serve as the nodes of the inter-actome easily exceeds 100,000. The number of function-ally relevant interactions between the components of this network, representing the links of the interactome, is expected to be much larger2.

This inter- and intracellular interconnectivity implies that the impact of a specific genetic abnormality is not restricted to the activity of the gene product that carries it, but can spread along the links of the network and alter the activity of gene products that otherwise carry no defects. Therefore, an understanding of a gene’s net-work context is essential in determining the phenotypic impact of defects that affect it3,4. Following on from this principle, a key hypothesis underlying this Review is that a disease phenotype is rarely a consequence of an abnormality in a single effector gene product, but reflects various pathobiological processes that inter-act in a complex network. A corollary of this widely held hypothesis is that the interdependencies among a cell’s molecular components lead to deep functional, molecular and causal relationships among apparently distinct phenotypes.

Network-based approaches to human disease have multiple potential biological and clinical applications. A better understanding of the effects of cellular intercon-nectedness on disease progression may lead to the iden-tification of disease genes and disease pathways, which, in turn, may offer better targets for drug development. These advances may also lead to better and more accurate biomarkers to monitor the functional integrity of net-works that are perturbed by diseases as well as to better disease classification. Here we present an overview of the organizing principles that govern cellular networks and the implications of these principles for understand-ing disease. These principles and the tools and method-ologies that are derived from them are facilitating the emergence of a body of knowledge that is increasingly referred to as network medicine5–7.

The human interactomeAlthough much of our understanding of cellular net-works is derived from model organisms, the past dec-ade has seen an exceptional growth in human-specific molecular interaction data8. Most attention has been directed towards molecular networks, including protein interaction networks, whose nodes are proteins that are linked to each other by physical (binding) interactions9,10; metabolic networks, whose nodes are metabolites that are linked if they participate in the same biochemi-cal reactions11–13; regulatory networks, whose directed links represent either regulatory relationships between a transcription factor and a gene14, or post-translational

*Center for Complex Networks Research and Department of Physics, Northeastern University, 110 Forsyth Street, 111 Dana Research Center, Boston, Massachusetts 02115, USA.‡Center for Cancer Systems Biology, Dana-Farber Cancer Institute, 44 Binney Street, Boston, Massachusetts 02115, USA.§Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, 75 Francis Street, Boston, Massachusetts 02115, USA.||Department of Cellular and Molecular Pharmacology, University of California, 1700 4th Street, Byers Hall 309, Box 2530, San Francisco, California 94158, USA.Correspondence to A.-L.B. e-mail: [email protected]:10.1038/nrg2918

Network medicine: a network-based approach to human diseaseAlbert-László Barabási*‡§, Natali Gulbahce*‡|| and Joseph Loscalzo§

Abstract | Given the functional interdependencies between the molecular components in a human cell, a disease is rarely a consequence of an abnormality in a single gene, but reflects the perturbations of the complex intracellular and intercellular network that links tissue and organ systems. The emerging tools of network medicine offer a platform to explore systematically not only the molecular complexity of a particular disease, leading to the identification of disease modules and pathways, but also the molecular relationships among apparently distinct (patho)phenotypes. Advances in this direction are essential for identifying new disease genes, for uncovering the biological significance of disease-associated mutations identified by genome-wide association studies and full-genome sequencing, and for identifying drug targets and biomarkers for complex diseases.

REVIEWS

56 | JANUARY 2011 | VOLUME 12 www.nature.com/reviews/genetics

© 2011 Macmillan Publishers Limited. All rights reserved

Most cellular components exert their functions through interactions with other cellular components, which can be located either in the same cell or across cells, and even across organs. In humans, the potential complexity of the resulting network — the human interactome — is daunting: with ~25,000 protein-coding genes, ~1,000 metabolites and an undefined number of distinct proteins1 and functional RNA molecules, the number of cellular components that serve as the nodes of the inter-actome easily exceeds 100,000. The number of function-ally relevant interactions between the components of this network, representing the links of the interactome, is expected to be much larger2.

This inter- and intracellular interconnectivity implies that the impact of a specific genetic abnormality is not restricted to the activity of the gene product that carries it, but can spread along the links of the network and alter the activity of gene products that otherwise carry no defects. Therefore, an understanding of a gene’s net-work context is essential in determining the phenotypic impact of defects that affect it3,4. Following on from this principle, a key hypothesis underlying this Review is that a disease phenotype is rarely a consequence of an abnormality in a single effector gene product, but reflects various pathobiological processes that inter-act in a complex network. A corollary of this widely held hypothesis is that the interdependencies among a cell’s molecular components lead to deep functional, molecular and causal relationships among apparently distinct phenotypes.

Network-based approaches to human disease have multiple potential biological and clinical applications. A better understanding of the effects of cellular intercon-nectedness on disease progression may lead to the iden-tification of disease genes and disease pathways, which, in turn, may offer better targets for drug development. These advances may also lead to better and more accurate biomarkers to monitor the functional integrity of net-works that are perturbed by diseases as well as to better disease classification. Here we present an overview of the organizing principles that govern cellular networks and the implications of these principles for understand-ing disease. These principles and the tools and method-ologies that are derived from them are facilitating the emergence of a body of knowledge that is increasingly referred to as network medicine5–7.

The human interactomeAlthough much of our understanding of cellular net-works is derived from model organisms, the past dec-ade has seen an exceptional growth in human-specific molecular interaction data8. Most attention has been directed towards molecular networks, including protein interaction networks, whose nodes are proteins that are linked to each other by physical (binding) interactions9,10; metabolic networks, whose nodes are metabolites that are linked if they participate in the same biochemi-cal reactions11–13; regulatory networks, whose directed links represent either regulatory relationships between a transcription factor and a gene14, or post-translational

*Center for Complex Networks Research and Department of Physics, Northeastern University, 110 Forsyth Street, 111 Dana Research Center, Boston, Massachusetts 02115, USA.‡Center for Cancer Systems Biology, Dana-Farber Cancer Institute, 44 Binney Street, Boston, Massachusetts 02115, USA.§Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, 75 Francis Street, Boston, Massachusetts 02115, USA.||Department of Cellular and Molecular Pharmacology, University of California, 1700 4th Street, Byers Hall 309, Box 2530, San Francisco, California 94158, USA.Correspondence to A.-L.B. e-mail: [email protected]:10.1038/nrg2918

Network medicine: a network-based approach to human diseaseAlbert-László Barabási*‡§, Natali Gulbahce*‡|| and Joseph Loscalzo§

Abstract | Given the functional interdependencies between the molecular components in a human cell, a disease is rarely a consequence of an abnormality in a single gene, but reflects the perturbations of the complex intracellular and intercellular network that links tissue and organ systems. The emerging tools of network medicine offer a platform to explore systematically not only the molecular complexity of a particular disease, leading to the identification of disease modules and pathways, but also the molecular relationships among apparently distinct (patho)phenotypes. Advances in this direction are essential for identifying new disease genes, for uncovering the biological significance of disease-associated mutations identified by genome-wide association studies and full-genome sequencing, and for identifying drug targets and biomarkers for complex diseases.

REVIEWS

56 | JANUARY 2011 | VOLUME 12 www.nature.com/reviews/genetics

© 2011 Macmillan Publishers Limited. All rights reserved

NETWORK MEDICINE

Page 10: Introduction to Network Medicine

RESEARCH ARTICLE SUMMARY◥

DISEASE NETWORKS

Uncovering disease-diseaserelationships through theincomplete interactomeJörg Menche, Amitabh Sharma, Maksim Kitsak, Susan Dina Ghiassian, Marc Vidal,Joseph Loscalzo, Albert-László Barabási*

INTRODUCTION: Adisease is rarely a straight-forward consequence of an abnormality in asingle gene, but rather reflects the interplayof multiple molecular processes. The rela-tionships among these processes are encodedin the interactome, a network that integratesall physical interactions within a cell, fromprotein-protein to regulatory protein–DNAand metabolic interactions. The documentedpropensity of disease-associated proteins tointeract with each other suggests that theytend to cluster in the same neighborhood ofthe interactome, forming a disease module, aconnected subgraph that contains all molecu-lar determinants of a disease. The accurateidentification of the corresponding diseasemodule represents the first step toward a sys-

tematic understanding of themolecularmech-anisms underlying a complex disease. Here,we present a network-based framework to iden-tify the location of disease modules within theinteractome and use the overlap between themodules to predict disease-disease relationships.

RATIONALE: Despite impressive advancesin high-throughput interactome mapping anddisease gene identification, both the interac-tome and our knowledge of disease-associatedgenes remain incomplete. This incomplete-ness prompts us to ask to what extent thecurrent data are sufficient to map out thedisease modules, the first step toward an in-tegrated approach toward human disease.To make progress, we must formulate math-

ematically the impact of network incomplete-ness on the identifiability of disease modules,quantifying the predictive power and the lim-itations of the current interactome.

RESULTS:Using the tools of network science,we show that we can only uncover diseasemodules for diseases whose number of asso-

ciated genes exceeds a crit-ical threshold determinedby thenetwork incomplete-ness.We find that diseaseproteins associated with226 diseases are clusteredin the samenetworkneigh-

borhood, displaying a statistically significanttendency to form identifiable diseasemodules.The higher the degree of agglomeration of thedisease proteins within the interactome, thehigher the biological and functional similar-ity of the corresponding genes. These find-ings indicate that many local neighborhoodsof the interactome represent the observablepart of the true, larger and denser diseasemodules.If two disease modules overlap, local per-

turbations causing one disease can disruptpathways of the other disease module as well,resulting in shared clinical and pathobiolog-ical characteristics. To test this hypothesis,wemeasure the network-based separation ofeach disease pair, observing a direct relationbetween the pathobiological similarity ofdiseases and their relative distance in theinteractome. We find that disease pairs withoverlapping diseasemodules display significantmolecular similarity, elevated coexpression oftheir associated genes, and similar symptomsand high comorbidity. At the same time, non-overlapping disease pairs lack any detectablepathobiological relationships. The proposednetwork-based distance allows us to predictthe pathobiological relationship even for dis-eases that do not share genes.

CONCLUSION: Despite its incompleteness,the interactome has reached sufficient cov-erage to allow the systematic investigationof disease mechanisms and to help uncoverthe molecular origins of the pathobiologicalrelationships between diseases. The intro-duced network-based framework can be ex-tended to address numerous questions at theforefront of network medicine, from inter-preting genome-wide association study datato drug target identification and repurposing.▪

RESEARCH

SCIENCE sciencemag.org 20 FEBRUARY 2015 • VOL 347 ISSUE 6224 841

ON OUR WEB SITE◥

Read the full articleat http://dx.doi.org/10.1126/science.1257601..................................................

Diseases within the interactome.The interactome collects all physical interactions betweena cell’s molecular components. Proteins associated with the same disease form connectedsubgraphs, called disease modules, shown for multiple sclerosis (MS), peroxisomal disorders(PD), and rheumatoid arthritis (RA). Disease pairs with overlapping modules (MS and RA)have some phenotypic similarities and high comorbidity. Non-overlapping diseases, like MSand PD, lack detectable clinical relationships.

The list of author affiliations is available in the full article online.*Corresponding author. E-mail: [email protected] this article as J. Menche et al., Science 347, 1257601(2015). DOI: 10.1126/science.1257601

on

Dec

embe

r 8, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 8, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 8, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 8, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 8, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 8, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 8, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 8, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 8, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Menche et al., Science 2015

DISEASES AS NETWORK NEIGHBORHOODS

Page 11: Introduction to Network Medicine

The Interactome as a Map

Page 12: Introduction to Network Medicine

Diseases As Local Neighborhoods

Page 13: Introduction to Network Medicine

Asthma

Parkinson’s

Leukemia

MS

Hypertension

Rheumatoid arthritis

Crohn’s disease

Type 2 diabetes

Glioblastoma

Ulcerative colitisHeart failure

NetworkClusteringMeansExplainableBiology

AIF1

ZBTB12

NFKBIZ

MERTKHHEX

CFB

CD58MICB

Diseases As Local Neighborhoods

Page 14: Introduction to Network Medicine

Interactome and disease genes

GWAS

Multiple sclerosisgenes

OMIM

Signalling

Complexes

Kinase - Substrate

Metabolic

Literature

Regulatory

Yeast two-hybrid

GWAS & OMIM

Other disease genes

Molecular interactions

Gene with multiple disease associations

OMIMImmunologic deficiencysyndromesHematologic diseasesBlood protein disorders

GWASConnective tissue diseasesAutoimmune diseasesJoint diseasesMusculoskeletal diseasesRheumatoid arthritis

Signaling, Complexes, Literature, RegulatoryInteraction with multiple lines of evidences

AKT1

HLA-B

HLA-C

STAT3

TAP2

NFKBIZ

IL2RA

TNFRSF1A

EHMT2PTK2

IL7R

MAPK1

Observable module for Multiple sclerosis

• %The interactome contains 141,296 physical interactions between 13,460 proteins •  We study 299 diseases with at least 20 gene associations

Menche et al., Science 2015

Page 15: Introduction to Network Medicine

Measures of network localization

multiple sclerosis proteins

shortest distance

connected component of size S=11

0 0.05

0.1 0.15

0.2 0.25

0.3 0.35

0.4 0.45

0.5

0 2 4 6 8 10 12 14 16 18 20

frequ

ency

size of largest component

datarandom

11

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1 2 3 4 5 6 7 8 9

frequ

ency

shortest distance d

datarandom

AIF1 TRAF6

VCAM1 IRF8

ITGB7 ADRA1B

CD6 HLA-DQA1

CD5 HLA-DQB1 SLC30A7

TRIM27 NEDD4

C2 MLANA CBLB

CBL

MALT1 PTPRC

CD40

PTPRK CD4

BAG6 CD28 PLEK TGFBR1

HLA-DQA2

CD58

CDSN

UBQLN4

TNFSF14

IL12A

IL12B

TNFRSF14

GRB2 CRK

MLH1

IL20RA

ZNF512B

DKKL1

SMYD2

FLNC

AHI1 ZFP36L1

UBE2I

TNFRSF1A AKT1 RAP1GAP IL2RA

PTK2 EHMT2

HERPUD1

MERTK DDX39B DHX16 HAAO

ARHGDIA

CD86

LCP1

YAP1 METTL1 CD24 DENND3

PSMA4

FGR

STAT3

POU5F1

MAPK1

YWHAH BATF

AR

PRRC2A KIF1B

JUN

HLA-B

MICA HLA-C MBP KLRC4

MICB

ZBTB12 TAP2

EXOC6 PDZK1

IL7R

MYOD1

ARRB1

ALB

NEDD9 NFKBIZ TNXB

BACH2 BANP

RDBP

HLA-DRB1

NOTCH4 HLA-DOB

HLA-DRB5

ZFP36L2

HLA-DRA FBXW7

4276

HLA-DMB

HHEX PFDN1

SIRT2

SLC15A2

SP140

CFB

EOMES

d=2

d=3d=3

•  We use two measures to quantify the interactome-based localization of a disease •  226 out of 299 diseases are significantly localized according to both measures

Page 16: Introduction to Network Medicine

Relations between Diseases

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5

Dis

trib

utio

n P(

d)

Shortest distance d

PDMS

Pairwise

Separated Modules

0.1

0.2

0.3

0.4

0.5

0.6

0 1 2 3 4

Dis

trib

utio

n P(

d)

Shortest distance d

MS

Pairwise

Overlapping Modules

RA

s = 1.3ABs = - 0.2AB

ABs

(d , d or d )AA BB AB (d , d or d )AA BB AB

s < 0AB s ≥ 0AB

FKBP7

PEX12

PEX3

SLC2A4

PEX19

UBQLN4

WRAP73

CCDC14

PEX10

ZNF512B

CAT

GPX3

ACAD

ACOX1

AHI1HADHA

TM6SF1

PEX16

HADHB

PEX11B

ABCD1

SMYD2

SLC27A2

HERPUD1

EWSR1

CD58

MED8

PEX1

PRR13

PEX5

PEX14

PEX2

IDH1

ZNF772

PEX13

PEX6

BANP

PEX26

TNXBMVK

SLC30A7

ZSCAN1

LK

FGR IL7R

FYN

4

CD5 BATFJUN

NFKBIL1

YOD1

DDX39B

6KA1 PTPN11

SIRT2

DHX16

NFKBIZ

STAT3

VDR

NCOR1

KIF1B

RDBP

RBPJ

HLA-DMB

TNPO3

HLA-DRA

SRSF1

HLA-DOB

ITGB7

PTPRC

CBLB

VCAM1

CDKN2A

BAG6TRAF2

PSMA4CD40 GMCL1

FAM107A

SUMO1

PFKLTRAF6

MIF

C2

RHOA

TNFAIP3

TNFSF14 ATF7IP

USP53

HLA-B

EHMT2 TRAF1

OLIG3

RPL14

PHYH

CCL21

CAPRIN2 KLF6

CDSN

PEX7AGXTIL2RA

MAPK1

YWHAG

PTK2

TNFRSF1A

HDAC2

STAT4

HLA-DRB1

TRA@

FCRL3

SMAD3

HLA-DRB5 TAP2

HLA-C

MALT1

POU5F1

ARHGDIA

HAAO

FAM167A

ARRB1

HSPA5REL

BACH2

SMARCC2

ALB

UBE2I

RSBN1

C5orf30

EXOC6GRB2

APOM

DDO

MKRN3

RAB35

SLX4

PHF19GNPAT PRRC2A

PTPN22 HLA-DQA2

PTPRK

GHRHHEX

RTF1

CFBAGPS

ADRA1B

IL23A

ACTA1 SLC22A4

S100A6

SLC15A2

F8PDZK1

MLYCD MICA

SSTR5

FKBP4PFDN1

RNF167

OTUD5

FLNC

MLH1

PADI4

MERTK

HRASAFF3 IL20RA

HLA-DQA1

PPARG

Multiple sclerosis (MS)Peroxisomal disorders (PD)Rheumatoid arthritis (RA)

•  We introduce a network-based measure to quantify the overlap/separation of two diseases •  Most disease pairs are well separated on the Interactome

Menche et al., Science 2015

Page 17: Introduction to Network Medicine

Network Distance vs. Biomedical Similarity

2 2

RR ksir evitaler

Separation smean

ytiralimis

motpmyS

Separation smean

ytiralimis

mret O

G

Separation smean

10 -3

10 -2

10 -1

10 0

-3 -2 -1 0 1 2ytirali

mis mret

OG

Separation

Expectation

smean

10 -1

10 0

-3 -2 0 1 0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

-3 -2 -1 0 1 2

noisserpxe-oC

Separation smean

10 -3

10 -2

10 -1

10 0

-3 -2 -1 0 1 2

ytiralimis

mret O

G

Separation smean

10 -3

10 -2

10 -1

-3 -2 -1 0 1 2

10 -1

100

101

102

103

-3 -2 -1-1 0 1

biological process molecular function

co-expression symptoms comorbidity

s AB s AB s AB

s ABs ABs AB

cellular component

•  Diseases that are close in the Interactome have similar biomedical properties

Page 18: Introduction to Network Medicine

The Disease Space

Type 1 diabetes

Rheumatoid arthritis

Sutoimmune diseases of the nervous systemDemyelinating autoimmune diseases

Immune System Diseases

1

5

67

8

12

1314

1110

9Retinitis pigmentosa

Retinal degeneration

Graves disease

Macular degeneration

Eye Diseases

1

2

3

4

9

10

Asthma

Respiratory hypersensitivity

Respiratory Tract Diseases

13

14

11

12

Cerebrovascular disorders

Myocardial infarction

Coronary artery disease

Myocardial ischemia

Cardiovascular Diseases

5

6

7

8

2

34

•  Diseases and their network-based relationships can be represented in a 3D Diseases-Space •  Diseases belonging to the same class agglomerate

Menche et al., Science 2015

Page 19: Introduction to Network Medicine

Overlapping diseases

•  Examples of unexpected disease relationships uncovered using the disease space

IL1RL1

IL18R1

HLA-DRA

HLA-DPA1

HLA-DQB1

HLA-DPB1

HLA-DOA

HLA-DQA2

IL33CDK2

SMAD3NOTCH4

IL2RB

PTPN2

RUNX3

ETS1

BACH2

UBE2E3

IL18RAP

XCR1

OLIG3

TNFAIP3

CTLA4

EGFR

KIAA1109

MYO9B

CCR4

SH2B3

PLEK

CCR1PTPRK

ARHGAP31

RGS1

ZMIZ1

SLC9A4

IL12A

RMI2

SYF2

LPP

IL21

PRM1

ATXN2

GLB1

HLA-DQA1

IL2

ITGA4

ICOS

ICOSLG

IKZF4 DPP10

ELF3

ORMDL3ADAM33

RANBP6

TSLP

CRB1

PLA2G7USP38 IL6RSLC25A46

SLC30A8

TBX21

MUC7

CHIT1

PBX2 PDE4D

C11orf30

BRD2

SUOX

Celiac disease

Celiac disease

asthma

asthma

celiacdiseaseasthma

atherosclerosis

coronaryartery disease

biliarytract diseases

hepaticcirrhosis

Intestinal immunenetwork for IGAproduction

Intestinal immune network

Page 20: Introduction to Network Medicine

OR I G INA L ART I C L E

A disease module in the interactome explains diseaseheterogeneity, drug response and captures novelpathways and genes in asthmaAmitabh Sharma1,2,3,†, Jörg Menche1,2,4,8,†, C. Chris Huang5, Tatiana Ort5,Xiaobo Zhou3, Maksim Kitsak1,2, Nidhi Sahni2, Derek Thibault3, Linh Voung3,Feng Guo3, Susan Dina Ghiassian1,2, Natali Gulbahce6, Frédéric Baribaud5, JoelTocker5, Radu Dobrin5, Elliot Barnathan5, Hao Liu5, Reynold A. Panettieri Jr7,Kelan G. Tantisira3, Weiliang Qiu3, Benjamin A. Raby3, Edwin K. Silverman3,Marc Vidal2,9, Scott T. Weiss3 and Albert-László Barabási1,2,3,4,8,*1Center for Complex Networks Research, Department of Physics, Northeastern University, Boston, MA 02115,USA, 2Center for Cancer Systems Biology (CCSB) andDepartment of Cancer Biology, Dana-Farber Cancer Institute,Boston, MA 02215, USA, 3Channing Division of Network Medicine, Department of Medicine, Brigham andWomen’s Hospital, Harvard Medical School, Boston, MA 02115, USA, 4Department of Theoretical Physics,Budapest University of Technology and Economics, H1111, Budapest, Hungary, 5Janssen Research &Development, Inc., 1400 McKean Road, Spring House, PA 19477, USA, 6Department of Cellular and MolecularPharmacology, University of California 1700, 4th Street, Byers Hall 308D, San Francisco, CA 94158, USA,7Pulmonary Allergy and Critical Care Division, Department of Medicine, University of Pennsylvania, 125 South31st Street, TRL Suite 1200, Philadelphia, PA 19104, USA, 8Center for Network Science, Central EuropeanUniversity, Nador u. 9, 1051 Budapest, Hungary and 9Department of Genetics, Harvard Medical School, Boston,MA 02115, USA

*To whom correspondence should be addressed at: Center for Complex Networks Research, Department of Physics, Northeastern University, Boston,MA 02115, USA. Email: [email protected]

AbstractRecent advances in genetics have spurred rapid progress towards the systematic identification of genes involved in complexdiseases. Still, the detailed understanding of the molecular and physiological mechanisms through which these genes affectdisease phenotypes remains a major challenge. Here, we identify the asthma disease module, i.e. the local neighborhood of theinteractome whose perturbation is associated with asthma, and validate it for functional and pathophysiological relevance,using both computational and experimental approaches. We find that the asthma disease module is enriched with modestGWAS P-values against the background of random variation, and with differentially expressed genes from normal andasthmatic fibroblast cells treated with an asthma-specific drug. The asthma module also contains immune responsemechanisms that are shared with other immune-related disease modules. Further, using diverse omics (genomics,

† These authors contributed equally to this work.Received: September 1, 2014. Revised: November 19, 2014. Accepted: January 5, 2015

© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Human Molecular Genetics, 2015, Vol. 24, No. 11 3005–3020

doi: 10.1093/hmg/ddv001Advance Access Publication Date: 12 January 2015Original Article

3005

at Northeastern U

niversity Libraries on Decem

ber 8, 2015http://hm

g.oxfordjournals.org/D

ownloaded from

RESEARCH ARTICLE

A DIseAse MOdule Detection (DIAMOnD)Algorithm Derived from a SystematicAnalysis of Connectivity Patterns of DiseaseProteins in the Human InteractomeSusan Dina Ghiassian1,2☯, Jörg Menche1,2,3☯, Albert-László Barabási1,2,3,4*

1 Center for Complex Networks Research and Department of Physics, Northeastern University, Boston,Massachusetts, United States of America, 2 Center for Cancer Systems Biology (CCSB) and Department ofCancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America, 3 Centerfor Network Science, Central European University, Budapest, Hungary, 4 Channing Division of NetworkMedicine, Department of Medicine, Brigham andWomen’s Hospital, Harvard Medical School, Boston,Massachusetts, United States of America

☯ These authors contributed equally to this work.* [email protected]

AbstractThe observation that disease associated proteins often interact with each other has fueledthe development of network-based approaches to elucidate the molecular mechanisms ofhuman disease. Such approaches build on the assumption that protein interaction networkscan be viewed as maps in which diseases can be identified with localized perturbation with-in a certain neighborhood. The identification of these neighborhoods, or disease modules,is therefore a prerequisite of a detailed investigation of a particular pathophenotype. Whilenumerous heuristic methods exist that successfully pinpoint disease associated modules,the basic underlying connectivity patterns remain largely unexplored. In this work we aim tofill this gap by analyzing the network properties of a comprehensive corpus of 70 complexdiseases. We find that disease associated proteins do not reside within locally dense com-munities and instead identify connectivity significance as the most predictive quantity. Thisquantity inspires the design of a novel Disease Module Detection (DIAMOnD) algorithm toidentify the full disease module around a set of known disease proteins. We study the per-formance of the algorithm using well-controlled synthetic data and systematically validatethe identified neighborhoods for a large corpus of diseases.

Author SummaryDiseases are rarely the result of an abnormality in a single gene, but involve a whole cas-cade of interactions between several cellular processes. To disentangle these complex inter-actions it is necessary to study genotype-phenotype relationships in the context of protein-protein interaction networks. Our analysis of 70 diseases shows that disease proteins are

PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004120 April 8, 2015 1 / 21

OPEN ACCESS

Citation: Ghiassian SD, Menche J, Barabási A-L(2015) A DIseAse MOdule Detection (DIAMOnD)Algorithm Derived from a Systematic Analysis ofConnectivity Patterns of Disease Proteins in theHuman Interactome. PLoS Comput Biol 11(4):e1004120. doi:10.1371/journal.pcbi.1004120

Editor: Andrey Rzhetsky, University of Chicago,UNITED STATES

Received: August 25, 2014

Accepted: January 9, 2015

Published: April 8, 2015

Copyright: © 2015 Ghiassian et al. This is an openaccess article distributed under the terms of theCreative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in anymedium, provided the original author and source arecredited.

Data Availability Statement: Data and source codefor the DIAMOnD algorithm are within the SupportingInformation files and can also be downloaded fromhttps://github.com/barabasilab/DIAMOnD. Aninteractive web-based version of the DIAMOnDalgorithm is available at http://diamond.barabasilab.com/.

Funding: This work was funded by NationalInstitutes of Health (NIH) Award #1U01HL108630-01.MAPGen (http://www.mapgenprogram.org/) and NIH,Centers of Excellence of Genomic Science (CEGS),NIH CEGS 1P50HG004233 (http://www.genome.gov/

Sharma et al., HMG 2015

Ghiassian et al., PLoS Comp Biol 2015

BUILDING DISEASE MODULES

Page 21: Introduction to Network Medicine

Disease Module Detection and Analysis

The general workflow of a detailed analysis for a disease of interest:

I Interactome construction II Disease Module Identification

III Validation IV Biological interpretation

- Gene expression data- Gene Ontologies- Pathways- Comorbidity

- OMIM, GWAS, literature - DIAMOnD: Disease Module Detection Algorithm

- Pathway prioritization- Molecular mechanism

Seed gene selection

- Binary interactions, metabolic couplings, regulatory interactions ...

Page 22: Introduction to Network Medicine

Fig 1. Topological properties of disease proteins within the Interactome. (A) Proteins associated with the same phenotype tend to localize in specificneighborhoods of the Interactome, indicating the approximate location of the corresponding disease modules. Topological network communities are highlyinterconnected groups of nodes. (B) Distribution of the fraction of disease proteins within the largest connected component (LCC) for 70 diseases. Only 10%-30% of the disease proteins are part of the LCC. (C) LCC size of proteins associated with lysosomal storage disease compared to random expectation. Out of45 disease proteins, 24 (53%) are part of the LCC (z-score = 23.42, empirical p-value< 10–6). (D) Significance of the LCC sizes as measured by the z-score

DIAMOnD and Disease Modules within the Human Interactome

PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004120 April 8, 2015 3 / 21

DISEASE MODULES VS COMMUNITIES

Page 23: Introduction to Network Medicine

original seed genes gene selectedat iteration i

DIAMOnD genes

legend:

iteration 3

iteration 2iteration 1initial seeds

0.180.460.460.070.53

0.460.210.460.29

p-value:

A B

genes connected to a seed gene

proto-module

DIAMOnDgenes

Disease module:C

DIAMOnD –Disease Module Detection Algorithm

�  purely%topological%method%� %all%genes%in%the%network%are%prioriSzed%according%to%their%potenSal%relevance%for%the%disease%

Page 24: Introduction to Network Medicine

sensitively on the initial level of completeness (Fig. 3C-F). Hence, the true positive rate can beestimated by removing varying fractions of seed proteins. For lysosomal storage disorders, forexample we find an estimated recall of*50% at iteration 40 (Fig. 3E). After 40 iterations, therecall saturates and reaches a plateau, indicating that thereafter only few DIAMOnD proteinsare expected to be truly disease associated. This saturation point may therefore be used as athreshold for the total number of DIAMOnD genes to consider.

(ii) A biological criterion for the threshold can be obtained from the validation according toFig. 4A,B. The number of DIAMOnD proteins with direct biological evidence reaches a plateauat*200 iteration steps, suggesting this as the maximal number that should be considered. Amore stringent criterion is to use the significance of the enrichment (see Materials & Methods).The enrichment is typically strongest within the highest ranked DIAMOnD proteins and de-creases with increasing iteration steps. For lysosomal storage diseases, for example, we find thatthe first 200 DIAMOnD proteins are similarly significantly enriched as the seed proteins(Fig. 4B). The largest connected component of the seed proteins aloneconsists of 24 (out of 45)proteins. When 200 DIAMOnD proteins are added, the largest connected component of the re-sulting module integrates 11 additional, previously disconnected seed proteins, resulting in amodule consisting of 234 proteins (Fig. 4C). Fig. 4F shows the distribution of the fraction of in-tegrated seed proteins across 70 diseases for several iterations. We find that with increasingnumber of DIAMOnD genes more and more disconnected seed proteins are integrated intothe module, thus allowing for an integrated analysis of their molecular mechanism.

Fig 4. Biological evaluation of DIAMOnD. (A) Validation of the DIAMOnD genes based on GeneOntology terms (see Materials & Methods). (B) Thesignificance of the similarity between DIAMOnD genes and seed genes suggests a cutoff of*200 DIAMOnD genes. (C) Network representation of thelysosomal storage diseasesmodule. (D,E) Summary of the validation for all 70 disease modules based on GeneOntology (D) and biological pathways (E). (F)Fraction of seed proteins that are contained in the LCC of the DIAMOnDmodule for varying iteration steps. The distributions show the values obtained from70 diseases. By introducing DIAMOnD proteins, previously disconnected seed proteins become part of the LCC.

doi:10.1371/journal.pcbi.1004120.g004

DIAMOnD and Disease Modules within the Human Interactome

PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004120 April 8, 2015 11 / 21

BUILDING A DISEASE MODULE

Page 25: Introduction to Network Medicine

ARTICLEReceived 7 May 2015 | Accepted 29 Nov 2015 | Published 1 Feb 2016

Network-based in silico drug efficacy screeningEmre Guney1,2, Jorg Menche1,3, Marc Vidal2,4 & Albert-Laszlo Barabasi1,2,3,5

The increasing cost of drug development together with a significant drop in the number of

new drug approvals raises the need for innovative approaches for target identification

and efficacy prediction. Here, we take advantage of our increasing understanding of the

network-based origins of diseases to introduce a drug-disease proximity measure that

quantifies the interplay between drugs targets and diseases. By correcting for the known

biases of the interactome, proximity helps us uncover the therapeutic effect of drugs, as well

as to distinguish palliative from effective treatments. Our analysis of 238 drugs used in 78

diseases indicates that the therapeutic effect of drugs is localized in a small network

neighborhood of the disease genes and highlights efficacy issues for drugs used in Parkinson

and several inflammatory disorders. Finally, network-based proximity allows us to predict

novel drug-disease associations that offer unprecedented opportunities for drug repurposing

and the detection of adverse effects.

DOI: 10.1038/ncomms10331 OPEN

1 Center for Complex Networks Research (CCNR) and Department of Physics, Northeastern University, 177 Huntington Avenue, 11th floor, Boston,Massachusetts 02115, USA. 2 Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, 450 BrooklineAvenue, Boston, Massachusetts 02215, USA. 3 Center for Network Science, Central European University, Nador utca 9, 1051 Budapest, Hungary.4 Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA. 5 Department of Medicine, Brigham andWomen’s Hospital, Harvard Medical School, 75 Francis Street, Boston, Massachusetts 02115, USA. Correspondence and requests for materials should beaddressed to A.-L.B. (email: [email protected]).

NATURE COMMUNICATIONS | 7:10331 | DOI: 10.1038/ncomms10331 | www.nature.com/naturecommunications 1

Guney et al., Nature Comm 2015

DRUGS

Page 26: Introduction to Network Medicine

DRUGS

a drug-target coincides with a disease protein. On the other hand, in490 of 18,162 unknown drug-disease pairs (2.7%) the drug targetsare known disease proteins, but not associated with the drug’sactual disease indication. Although in both classes (known andunknown), the overlap between drug targets and disease proteins islow, the much higher ratio among known drug-disease associations(Fisher’s exact test, odds ratio¼ 6.6, two-sided P¼ 5.2" 10# 27)suggests that direct targeting of known disease proteins is a rare butimportant therapeutic component in disease treatment.

Drugs target the local neighborhood of the disease proteins.We first test how well relative proximity discriminates the 402known drug-disease pairs from the 18,162 unknown drug-diseasepairs by comparing the area under Receiver Operating Char-acteristic (ROC) curve (AUC, Methods section) for differentdistance measures. In addition to the closest (dc) and shortest (ds)measures discussed above, we measure relative proximity betweena drug and a disease using three other network-based distancemeasures: (i) the kernel measure, dk, which downweights longer

ABCC8

VEGFA

RUNX1

INS

KAT6A

TOP2A

IRS1

TOP2B

CAPN10

NPM1

A

Disease gene

Drug targetShortest path to the

closest disease gene

d

d – µR !R

"R

!R!R

z =

Targets (T) Disease genes (S)

s1

s2

s3

t1

t2

Random gene sets with the same degrees

...

T1S1d1`

Tn Sndn

s1 t1

s2

s3

t2

2+3 2d=

Drug - disease proximity

GliclazideDaunorubicin

Type 2 diabetes

Acute myeloid leukaemia

dc = 2.5

zc = 1.3

dc = 1.0

zc = –1.6

zc = 1.0zc = –3.3

dc = 2.0

dc = 1.0

a

b

c

Disease genes

Acute myeloid leukaemiaType 2 diabetes

Drug targetsDrug targets

Gliclazide Daunorubicin

Figure 1 | Network-based drug-disease proximity. (a) Illustration of the closest distance (dc) of a drug T with targets t1 and t2 to the proteins s1, s2 and s3

associated with disease S. To measure the relative proximity (zc), we compare the distance dc between T and S to a reference distribution of distancesobserved if the drug targets and disease proteins are randomly chosen from the interactome. The obtained proximity zc quantifies whether a particular dc issmaller than expected by chance. To account for the heterogeneous degree distribution of the interactome and differences in the number of drug targetsand disease proteins, we preserve the number and degrees of the randomized targets and disease proteins. (b) The shortest paths between drug targetsand disease proteins for two known drug-disease associations: Gliclazide, a T2D drug with two targets and daunorubicin, a drug used for AML that also hastwo targets in the interactome. The subnetwork shows the shortest paths connecting each drug target to the nearest disease proteins. Proteins arecoloured with respect to the disease they are associated with: T2D (blue) and AML (red). Drug targets are represented as triangles and coloured accordingto whether they are targets of gliclazide (light blue) and daunorubicin (brown). Blue and red links illustrate the shortest path from the drug targets to thenearest disease proteins (of T2D and AML, respectively). Node size scales with the degree of the node within the subnetwork. In case of multiple diseaseproteins with the equal shortest path lengths to the target, the disease protein with lowest degree in the interactome is shown. (c) The proximity zc ofgliclazide and daunorubicin to T2D and AML, indicating low zc for the recommended use of these drugs and high zc for their non-recomended use.

NATURE COMMUNICATIONS | DOI: 10.1038/ncomms10331 ARTICLE

NATURE COMMUNICATIONS | 7:10331 | DOI: 10.1038/ncomms10331 | www.nature.com/naturecommunications 3

Page 27: Introduction to Network Medicine

PROXIMITY TO DISEASE MODULES

all of the drug targets are disease proteins, all the drugsare proximal to the disease with the only exceptionof disopyramide, a cardiac arrhythmia drug (Fig. 3). In176 of the remaining 340 known drug-disease associationsfor which the drug targets do not coincide with any

of the disease proteins, the drug targets are proximalto the disease, indicating that the interactome canhighlight non-obvious drug-disease associations inwhich the drug does not directly target known diseaseproteins.

a

b c

Seperation (dss)

dc dk

dcc

dss

Disease module

Drugds

Center (dcc)Kernel (dk)Shortest (ds)Closest (dc)

AU

C (

%)

R2 = 0.175

Ave

rage

clo

sest

pat

h le

ngth

(d c

)

Degree (k)

R2 = 0.003

Pro

xim

ity (

z c)

Degree (k)

AUCCoverage

AU

C v

s co

vera

ge (

%)

Targe

t

prox

imity

Targe

t

PPI Targe

t

Chemica

lGO

LINCS

Side

effec

t

Drug–drug similarity-based classification

edP=5.1×10–14

P=4.5×10–9

Num

ber

of d

rug

dise

ase

pairs

UnknownIn trials

Not in trialsKnown

ProximalDistant

80

70

60

50

40

30

4

3

2

1

0

1 10 10100

10010,000

100

0

–5

–101

75 7,500

505,000

25

2,5000

0

Figure 2 | Validating drug-disease proximity. (a) AUC is shown for relative proximity, z, calculated using five different distance measures. The closestmeasure, dc, considers the shortest path length from each target to the closest disease protein, the shortest measure, ds averages over all shortest pathlengths to the disease proteins. See the text for the definition of the kernel (dk), centre (dcc) and separation (dss) measures. (b) Average shortest pathlength between drug targets and disease proteins versus average drug-target degree for known drug-disease pairs. (c) Drug-disease proximity versusaverage degree of drug targets for known drug-disease pairs. (d) The plot shows AUC and coverage values for drug similarity-based measures based on therelative proximity between the targets (target proximity), the interactome-based distance between the targets (target PPI), sharing drug targets (target),chemical similarity (chemical), GO terms shared among the targets (GO), common differentially regulated genes in the perturbation profiles of the twodrugs in LINCS database (LINCS), and common side effects (side effect). Coverage is defined as the percentage of drug-disease associations for which themethod can make predictions. (e) Number of proximal and distant drug-disease pairs among known and unknown drug-disease associations (Fisher’s exacttest, odds ratio¼ 2.1 and P¼ 5.1" 10# 14). The unknown drug-disease associations are further categorized based on whether the association is in clinicaltrials (in trials) or not (not in trials, Fisher’s exact test, odds ratio¼ 1.6, P¼4.5" 10#9).

NATURE COMMUNICATIONS | DOI: 10.1038/ncomms10331 ARTICLE

NATURE COMMUNICATIONS | 7:10331 | DOI: 10.1038/ncomms10331 | www.nature.com/naturecommunications 5

Page 28: Introduction to Network Medicine

GOING FURTHER

Full text: http://barabasi.com/networksciencebook/