doi:10.1038/msb4100047 … their variation. the field of genetics approached this problem by...

14
Ab initio genotype–phenotype association reveals intrinsic modularity in genetic networks Noam Slonim 1,3 , Olivier Elemento 2,3 and Saeed Tavazoie 2, * 1 Department of Physics, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA and 2 Department of Molecular Biology, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA * Corresponding author. Department of Molecular Biology, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Washington Street, Carl Icahn, Room 245, Princeton, NJ 08544, USA. Tel.: þ 1 609 258 0331; Fax: þ 1 609 258 3565; E-mail: [email protected] 3 These authors contributed equally to this work Received 30.8.05; accepted 5.12.05 Microbial species express an astonishing diversity of phenotypic traits, behaviors, and metabolic capacities. However, our molecular understanding of these phenotypes is based almost entirely on studies in a handful of model organisms that together represent only a small fraction of this phenotypic diversity. Furthermore, many microbial species are not amenable to traditional laboratory analysis because of their exotic lifestyles and/or lack of suitable molecular genetic techniques. As an adjunct to experimental analysis, we have developed a computational information-theoretic framework that produces high-confidence gene–phenotype predictions using cross-species distributions of genes and phenotypes across 202 fully sequenced archaea and eubacteria. In addition to identifying the genetic basis of complex traits, our approach reveals the organization of these genes into generic preferentially co-inherited modules, many of which correspond directly to known enzymatic pathways, molecular complexes, signaling pathways, and molecular machines. Molecular Systems Biology 31 January 2006; doi:10.1038/msb4100047 Subject Categories: computational methods; microbiology & pathogens Keywords: comparative genomics; genotype–phenotype association; information theory; microbiology; modularity Introduction Since the time of Gregor Mendel, a central focus of biology has been to understand the hereditary basis of organismal traits and their variation. The field of genetics approached this problem by identifying heritable variation in a phenotype of interest within a single species. Using recombination, the genetic basis of this variation could be mapped to individual genes. This simple approach, coupled with biochemical and cell biological analysis of gene products and their interactions, is at the core of our modern molecular understanding of life (Alberts et al, 1994). Today, one might pose the same genotype–phenotype question in a different way: can we understand the genetic basis of a trait by differential inheritance of genetic elements across many species that show variation in the expression of that trait? Several recent works have used the availability of complete microbial genomes, along with their phenotype annotations, to demonstrate the feasibility of such a program (Huynen et al, 1998; Levesque et al, 2003; Makarova et al, 2003; Jim et al, 2004; Korbel et al, 2005). In all these works, the basic output is essentially a list of genes that are predicted to be associated with a particular trait. However, the expression of a complex phenotype often involves the coordinated activity of multiple functional modules, such as signaling pathways or molecular complexes. Here, we propose an information- theoretic computational approach that uses complete genome sequences and their phenotypic annotations, to recover this rich structure. We pose a simple question: are there generic modules that are preferentially inherited for the expression of a specific microbial trait? The common ancestry of extant species, and widespread horizontal gene transfer among prokaryotes (Lerat et al, 2005), would facilitate the sharing of such modules. In fact, one would expect that once optimized, such modules would be preferentially employed for expressing common phenotypic traits, and that the statistical signature of this differential coinheritance would be detectable across a large enough number of diverse species. Here, we have applied our approach to a set of 202 complete microbial genomes, focusing on diverse phenotypic characteristics such as behavior, cellular morphology, physiological capacity, cellular differen- tiation, and pathogenicity. In all these cases, our approach identifies the known genes involved, and reveals their organization into robust modules. Our observations support the notion that modularity in molecular networks is a native property of biological systems, and not an artifact of biases historically imposed by biologists (Hartwell et al, 1999). & 2006 EMBO and Nature Publishing Group Molecular Systems Biology 2006 msb4100047-E1 Molecular Systems Biology (2006) doi:10.1038/msb4100047 & 2006 EMBO and Nature Publishing Group All rights reserved 1744-4292/06 www.molecularsystemsbiology.com 2006.0005

Upload: vuongkhanh

Post on 12-Apr-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Ab initio genotype–phenotype association revealsintrinsic modularity in genetic networks

Noam Slonim1,3, Olivier Elemento2,3 and Saeed Tavazoie2,*

1 Department of Physics, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA and 2 Department of Molecular Biology, Lewis-SiglerInstitute for Integrative Genomics, Princeton University, Princeton, NJ, USA* Corresponding author. Department of Molecular Biology, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Washington Street, Carl Icahn,Room 245, Princeton, NJ 08544, USA. Tel.:þ 1 609 258 0331; Fax: þ 1 609 258 3565; E-mail: [email protected] These authors contributed equally to this work

Received 30.8.05; accepted 5.12.05

Microbial species express an astonishing diversity of phenotypic traits, behaviors, and metaboliccapacities. However, our molecular understanding of these phenotypes is based almost entirelyon studies in a handful of model organisms that together represent only a small fraction of thisphenotypic diversity. Furthermore, many microbial species are not amenable to traditionallaboratory analysis because of their exotic lifestyles and/or lack of suitable molecular genetictechniques. As an adjunct to experimental analysis, we have developed a computationalinformation-theoretic framework that produces high-confidence gene–phenotype predictions usingcross-species distributions of genes and phenotypes across 202 fully sequenced archaea andeubacteria. In addition to identifying the genetic basis of complex traits, our approach reveals theorganization of these genes into generic preferentially co-inherited modules, many of whichcorrespond directly to known enzymatic pathways, molecular complexes, signaling pathways, andmolecular machines.Molecular Systems Biology 31 January 2006; doi:10.1038/msb4100047Subject Categories: computational methods; microbiology & pathogensKeywords: comparative genomics; genotype–phenotype association; information theory; microbiology;modularity

Introduction

Since the time of Gregor Mendel, a central focus of biology hasbeen to understand the hereditary basis of organismal traitsand their variation. The field of genetics approached thisproblem by identifying heritable variation in a phenotype ofinterest within a single species. Using recombination, thegenetic basis of this variation could be mapped to individualgenes. This simple approach, coupled with biochemical andcell biological analysis of gene products and their interactions,is at the core of our modern molecular understanding of life(Alberts et al, 1994).

Today, one might pose the same genotype–phenotypequestion in a different way: can we understand the geneticbasis of a trait by differential inheritance of genetic elementsacross many species that show variation in the expression ofthat trait? Several recent works have used the availabilityof complete microbial genomes, along with their phenotypeannotations, to demonstrate the feasibility of such a program(Huynen et al, 1998; Levesque et al, 2003; Makarova et al,2003; Jim et al, 2004; Korbel et al, 2005). In all these works, thebasic output is essentially a list of genes that are predicted to beassociated with a particular trait. However, the expression of acomplex phenotype often involves the coordinated activity of

multiple functional modules, such as signaling pathwaysor molecular complexes. Here, we propose an information-theoretic computational approach that uses complete genomesequences and their phenotypic annotations, to recover thisrich structure.

We pose a simple question: are there generic modules thatare preferentially inherited for the expression of a specificmicrobial trait? The common ancestry of extant species, andwidespread horizontal gene transfer among prokaryotes (Leratet al, 2005), would facilitate the sharing of such modules.In fact, one would expect that once optimized, such moduleswould be preferentially employed for expressing commonphenotypic traits, and that the statistical signature of thisdifferential coinheritance would be detectable across a largeenough number of diverse species. Here, we have applied ourapproach to a set of 202 complete microbial genomes, focusingon diverse phenotypic characteristics such as behavior,cellular morphology, physiological capacity, cellular differen-tiation, and pathogenicity. In all these cases, our approachidentifies the known genes involved, and reveals theirorganization into robust modules. Our observations supportthe notion that modularity in molecular networks is a nativeproperty of biological systems, and not an artifact of biaseshistorically imposed by biologists (Hartwell et al, 1999).

& 2006 EMBO and Nature Publishing Group Molecular Systems Biology 2006 msb4100047-E1

Molecular Systems Biology (2006) doi:10.1038/msb4100047& 2006 EMBO and Nature Publishing Group All rights reserved 1744-4292/06www.molecularsystemsbiology.com

2006.0005

Results

Recovering generic gene (GG) modules underlyingobserved microbial traits

Given the sequenced genomes of many species, a gene can bedescribed by a vector of ones and zeros, indicating thepresence or absence of homologs of this gene in each of theavailable genomes. This representation, first suggested in(Pellegrini et al, 1999), is often referred to as the genephylogenetic profile. Certain phenotypes like motility inbacteria can be described by similar vectors, indicating thepresence (e.g., motile) or absence (non-motile) of thisphenotype across the same set of species. The underlyingassumption in our study is that genes whose phylogeneticprofiles closely correlate with the phenotypic descriptions arelikely to be involved in some aspects of the correspondingphenotypes. For example, it is natural to expect that thephylogenetic profile of a gene encoding a flagellar apparatuscomponent will be positively correlated with the pattern ofmotility/non-motility across many bacterial genomes.

Our approach can be summarized as follows (Figure 1).We created phylogenetic profiles for all the B600 000 genes in202 fully sequenced prokaryotic genomes. Next, we estimatedthe statistical correlation between each gene and a givenphenotype through their mutual information (Cover andThomas, 1991). We then collected only genes with astatistically significant correlation with the phenotype, fromall the organisms having the phenotype, and merged them intoa single cross-genome list. Within this list, we identifiedgroups of homologs, termed here phenotype generic genes(GGs) that are predicted to constitute the genetic basis of theexamined phenotype. Finally, we found robust modules

among these phenotype GGs with phylogenetic profiles thatare highly informative about each other. Specifically, eachmodule corresponds to a set of GGs that are consistently placedby a clustering algorithm in the same cluster, in numerous runswith different initial conditions (see Materials and methods).In other words, these are generic modules, required in arelatively conserved form by most microbes expressing thephenotype, while typically unnecessary for the remainingorganisms. As a concrete example, we first present our resultsfor a well-studied bacterial behavior, that of motility. Then, wedetail our results for phenotypes like Gram-negativity,endospore formation, oxygen respiration, and intracellularpathogenicity. The complete data, results, and relevant soft-ware are available at our Web site (http://tavazoielab.princeton.edu/genphen/).

Behavior: motility

Our cross-genome list for motility consists of 3833 genes,collected from 92 motile eubacteria and archaea, based ontheir significant positive correlation with this phenotype.These genes correspond to only 75 groups of homologs, ormotility GGs. These 75 motility GGs were then clustered basedon their phylogenetic profiles, yielding 14 robust modules (seeMaterials and methods).

In Figure 2A, we present the phylogenetic profiles of five ofthese 14 modules, with the maximal average correlation withthe motility phenotype. The rows correspond to the motilityGGs, the columns to the different organisms in our data, andeach entry indicates whether a particular GG is representedin the genome of a specific organism. Rows in Figure 2A areorganized according to the obtained modules, and columns aregrouped based on broad phylogenetic classifications. As seenin Figure 2A, the observed motility GGs induce a remarkablyaccurate dichotomy between motile versus non-motile organ-isms. Note that in a few cases, we detect many motility-relatedgenes in the genomes of non-motile organisms (e.g., forBurkholderia mallei), in agreement with previous studies(Nierman et al, 2004).

Our method uncovers the main modules underlying micro-bial motility with impressive precision. The first two modulesconsists solely of genes encoding bacterial flagella compo-nents; the third module corresponds to the chemotaxispathway (receptors and signal transduction); the next moduleconsists of two GGs associated with the flagella outer pair ofrings (L- and P-ring) that are present only in Gram-negativebacteria, and support the proximal rod through the outermembrane (OM). Finally, the fifth module includes s54, aknown regulator of nitrogen metabolism in Escherichia colithat also regulates motility genes in other species (Jagan-nathan et al, 2001; Wolfe et al, 2004). The second motility GGin this module is a s54-dependent transcriptional regulator,suggesting its potential role in motility.

In Figure 2B, we further explore the relations between thesefive modules. Every entry in this matrix-figure indicates theprobability of two GGs to be placed in the same cluster by theclustering algorithm (see Materials and methods). Evidently,the distinction between the two main flagellar modules israther weak, and is likely due to the fact that the secondmodule is less dominant in a-proteobacteria; however, theFigure 1 A schematic overview of the approach.

Modularity from genotype–phenotype associationN Slonim et al

msb4100047-E2 Molecular Systems Biology 2006 & 2006 EMBO and Nature Publishing Group

2006.0005

distinction between the other modules is extremely sharp. Forexample, a chemotaxis GG and a flagellar GG almost neverend up in the same cluster. Supplementary Figures S1 and S2present a similar analysis for all 14 modules obtained by ourapproach.

To further validate our results, in Figure 3A, we illustrate theE. coli chemotactic pathway and flagellar apparatus (Kanehisaand Goto, 2000); our approach recovers most of the genes inthe E. coli motility system, and moreover partitions these genesinto biologically meaningful modules. The genes that are notdetected by our procedure correspond to two oppositescenarios: some genes are too specific, while others are tooabundant (Figure 3B). For example, fliO is too specific, as wedetect its variants only in E. coli and seven other closely relatedspecies. On the other hand, fliI is too abundant, as we find fliIvariants in motile as well as non-motile microbes (Dreyfuset al, 1993). Interestingly, all three scenarios can occur withinthe same operon, as fliP (motility GG-5), fliO, and fliI, all residenext to each other in the E. coli genome. In addition, the E. coliflagella-related chaperones (FliT, FliJ, FlgN) and transcriptionfactors (FlhC, FlhD, FlgM) are all too specific, present onlyin some of the g/b-proteobacteria in our data. Within thechemotaxis system, cheY is too abundant while cheZ is toospecific. Indeed, it was previously suggested that in theabsence of CheZ, the CheV protein (motility GG-38) couldfulfill a similar function (Pittman et al, 2001).

Morphology: Gram-negativity

Gram-negative bacteria are in general characterized by thepresence of an additional membrane layer, the OM that servesas a permeability barrier to prevent the entry of toxiccompounds while allowing the influx of nutrient molecules(Nikaido, 2003). The biogenesis of the OM is only partiallyunderstood, and methods to probe the assembly process areonly starting to emerge (Ruiz et al, 2005; Wu et al, 2005). Ourapproach provides an appealing alternative for gaining novelinsights regarding this phenotype.

Our cross-genome list for Gram-negativity consists of 4678genes collected from 105 Gram-negative bacteria based ontheir significant correlation with this phenotype. These genescorrespond to 117 groups of homologs, or Gram-negative GGs.Cluster analysis of these 117 GGs yields 19 robust modules(Supplementary Figures S3 and S4), four of which arepresented in Figure 4A. As for the motility phenotype, theGram-negative GGs induce a pronounced dichotomy betweenGram-negative and Gram-positive bacteria.

The first two modules in Figure 4A have the maximumaverage correlation with the Gram-negative phenotype, andare strongly related to each other (Supplementary Figure S4).These two modules are present in almost all Gram-negativebacteria in our data, with the notable exception of Mycoplasmagenomes; indeed, Mycoplasma stain Gram-negative but lacka cell wall and are therefore expected to have special geneticcharacteristic with respect to this phenotype (Hegermann et al,2002). In addition, both modules are absent from a fewintracellular proteobacteria, such as Buchnera, consistent withprevious reports that detected very few genes encodinglipoproteins and OM proteins in these species (Shigenobuet al, 2000). The first module encapsulates most of the lipid-A

biosynthesis pathway. Lipid-A is the hydrophobic anchor oflipopolysaccharide (LPS), which is present in high abundancein the OM of Gram-negative bacteria. As shown in Figure 4B,this module captures many of the components of the lipid-Abiosynthesis pathway in E. coli. The only genes that are notdetected by our analysis are lpxH and kdsA: the first is toospecific, found mainly in g/b-proteobacteria, while the latter istoo abundant, that is, a variant of this gene is present in manyGram-positive bacteria (Figure 4C). Another member in thisfirst module is GG-1, which corresponds to the yaeT gene fromE. coli. This gene was recently found to be involved in OMassembly (Wu et al, 2005); moreover, the homolog of yaeT inNeisseria meningitides, also in GG-1, was found to be requiredfor LPS and phospholipid transport to the OM (Genevrois et al,2003). This further reinforces our prediction that GG-1 genesare functionally related to the lipid-A biosynthesis pathway inmany organisms.

The second module includes several components (GG-9,GG-10, GG-12) associated with a single—and apparently quitegeneric—system involved in stabilizing the OM organizationin an energy-dependent manner (Nikaido, 2003). The thirdmodule captures several members of the type I export pathwayof proteins across the OM, for example, GG-20, GG-53, GG-96and GG-111 (Nikaido, 2003). In the fourth module, we findmainly GGs that are glutaredoxin and glutathione related.Glutaredoxins are a family of proteins that catalyze a varietyof redox reactions, particularly the reduction of proteindisulfides. They are reduced by glutathione that is foundprimarily in Gram-negative bacteria (Copley and Dhillon,2002). Both glutaredoxin and glutathione are involved in theresistance of Gram-negative cells to arsenic, and possibly alsoin the cellular response to oxidative stress (Berardi andBushweller, 1999). Interestingly, this entire module is typicallyabsent from d/e-proteobacteria.

Physiology: oxygen respiration

Oxygen respiration is a fundamentally important bioenergeticprocess in bacteria. Microorganisms respond differently tooxygen; for strict aerobes, it is essential, while for strictanaerobes, it is toxic; facultative microbes can grow either inthe presence or absence of oxygen. Below, we discuss theapplication of our method to these three phenotypic classes.

Our cross-genome list for the aerobic phenotype consists of2828 genes collected from 71 strictly aerobic organisms; these2828 genes correspond to 130 groups of homologs, or aerobicGGs, that were clustered as before (Supplementary Figures S5and S6). Two of the resulting modules are presented in theupper part of Figure 5. The first module captures almostperfectly the NADH dehydrogenase I complex, a commoncomponent of the electron transport chain. Specifically, amongthe 14 E. coli genes that participate in this complex, only twoare missing from this module; nuoG is too abundant, whilenuoJ is in a different module. Unlike previously consideredphenotypes, the members of this module, while highlyenriched in aerobic organisms, also show significant presencein the facultative and anaerobic species. On the other hand, thesecond module includes several GGs associated with thecytochrome C-oxidase complex, which are typically absent

Modularity from genotype–phenotype associationN Slonim et al

& 2006 EMBO and Nature Publishing Group Molecular Systems Biology 2006 msb4100047-E3

2006.0005

GG-3

Flag

ella

r bi

osyn

thet

ic p

rote

in f

lhB GG-4

Flag

ella

r bi

osyn

thet

ic p

rote

in f

lhA GG-5

Flag

ella

r bi

osyn

thet

ic p

rote

in f

liP GG-22

Flag

ella

r bi

osyn

thet

ic p

rote

in f

liR GG-56

Flag

ella

r bi

osyn

thet

ic p

rote

in f

liQ GG-6

Flag

ella

r ho

ok f

lgE

/F/G GG-7

Flag

ella

r m

otor

sw

itch

fliG GG-10

Flag

ella

r ba

sal-

body

rod

flg

C GG-12

Flag

ella

r M

S-ri

ng f

liF GG-13

Flag

ella

r ho

ok-a

ssoc

iate

d pr

otei

n 1

flgK GG-18

Flag

ella

r m

otor

sw

itch

fliN GG-21

Flag

ella

r m

otor

sw

itch

fliM GG-27

Flag

ella

r ho

ok-a

ssoc

iate

d pr

otei

n 3

flgL GG-29

Flag

ella

r ho

ok-a

ssoc

iate

d pr

otei

n 2

fliD GG-8

Flag

ellin

fliC GG-17

Mot

ility

pro

tein

A m

otA GG-74

Flag

ella

r pr

otei

n fl

iS GG-20

Mot

ility

pro

tein

B m

otB GG-1

Met

hyl-

acce

ptin

g ch

emot

axis

pro

tein GG-11

Che

mot

axis

pro

tein

che

A GG-45

Met

hyl-

acce

ptin

g ch

emot

axis

pro

tein GG-73

Met

hyl-

acce

ptin

g ch

emot

axis

pro

tein GG-38

Che

mot

axis

pro

tein

che

V GG-15

Che

mot

axis

pro

tein

che

W GG-2

Che

mot

axis

met

hyltr

ansf

eras

e ch

eR GG-30

Glu

tam

ate

met

hyle

ster

ase

cheB GG-32

Flag

ella

r L

-rin

g pr

otei

n pr

ecur

sor

flgH GG-36

Flag

ella

r P-

ring

pro

tein

pre

curs

or f

lgI GG-9

RN

A-p

olym

eras

e si

gma-

54 f

acto

r GG-14

Tra

nscr

iptio

n fa

ctor

, sig

ma-

54 d

epen

dent

Mot

ileN

on-m

otile

0.0

0.5

bits

0.95

0.98

0.99

1.00

0.95

Photobacter_profundumSS9Vibrio_cholerae

Vibrio_parahaemolyticusVibrio_vulnificusCMCP6Pseudomonas_aeruginosa

Pseudomonas_putidaKT2440Salmonel_typhi

Salmonel_typhiTy2Salmonel_typhimuriumLT2

Shewanel_oneidensisVibrio_vulnificusYJ016Pseudomonas_syringae

Salmonel_enter_ParaATCC9150Erwin_caroto_atrosSCRI1043Idiomarina_loihiensis_L2TR

Xanthomonas_campestrisPhotorhabdus_luminescens

Xanthomonas_citriE_coli_CFT073

E_coli_K12E_coli_O157H7

E_coli_O157H7_EDL933Yersin_pseTuberculosIP32953Legionella_pneumophil_LensLegionella_pneumophil_ParisLegionella_pneumophil_Phil1

γ-Proteobacteria

Rhodopseudomonas_palustr009Bradyrhizobium_japonicum

Sinorhizobium_melilotiAgrobact_tumef_C58CereonAgrobact_tumef_C58UWash

Caulobacter_crescentusSilicibacter_pomeroyiDSS-3

α-Proteobacteria

Chromobacterium_violaceumBordetella_bronchiseptic

BurkholdPseudoMallK96243Nitrosomonas_europaea

Azoarcus_sp_EbN1

β-Proteobacteria

Geobacter_sulfurreducensDesulfovi_vulgar_Hildenbor

Wolinella_succinogenesHelicobacter_hepaticus

Campylobact_jejuniCampylobact_jejuniRM1221

Bdellovibrio_bacteriovorHelicobacter_pylori_26695

Helicobacter_pylori_J99

δ/ε-Proteobacteria

Bacillus_haloduransBacillus_lichenif_DSM_13

Bacillus_subtilisThermoanaerobact_tengcongens

Oceanobacillus_iheyensisClostridium_acetobutylicum

Clostridium_tetani_E88Bacillus_thuring_konkukian

Bacillus_anthracis_SterneBacillus_cereus_ZK

Bacillus_cereus_ATCC14579Bacillus_cereus_ATCC10987

Bacillus_anthracis_A2012Listeria_monocytogenes

Listeria_innocuaListeria_monocytogen4bF2365

Bacillus_anthracis_AmesMycoplasma_gallisepticum

Mycoplasma_genitaliumMycoplasma_mobile163KMycoplasma_pneumoniae

Mycoplasma_pulmonis

Firmicutes

Leptospir_inter_serovCopenhLeptospir_inter_serovLai

Treponema_pallidumTreponema_denticATCC35405

Thermotoga_maritimaBorrelia_burgdorferiBorrelia_garinii_PBi

Pirellula_spNostoc_sp

Dehalococcoides_ethenog195Synechococcus_spWH8102

Bacteria-miscellan.

Haloarc_marismortATCC43049Methanococcus_maripaludisS2

Archaeoglobus_fulgidusThermococcus_kodakaraensKOD1

Pyrococcus_horikoshiiHalobacterium_sp

Methanococcus_jannaschiiPyrobaculum_aerophilum

Pyrococcus_furiosusThermoplasma_volcanium

Aeropyrum_pernixThermoplasma_acidophilum

Archaea

Acinetob_ADP1Mannheim_succiniciprMBEL55E

Haemophilus_influenzaeFrancis_tularen_tularensis

γ-Proteobacteria

BurkholdMallATCC23344

β-Proteobacteria

Geobacillus_kaustophHTA426Clostridium_perfringens

StaphylococcusEpidATCC12228StaphylococcusAurMW2StaphylococcusAurMu50StaphylococcusAurCOL

StaphylococcusAurAurMRSA252Mycoplasma_penetrans

Mesoplasma_florum_L1Mycoplasma_hyopneumonia232

Mycoplasma_mycoidesStreptococcusAgalacti2603

StreptococcusAgalactiNEM316StreptococcusPneumoniaTIGR4

StreptococcusMutansStreptococcusPneumoniaR6

StreptococcusPyogenesM1GASStreptococcusPyogMGAS10394

StreptococcusPyogMGAS315StreptococcusPyogMGAS8232

StreptococcusPyogSSI-1StreptococcusThermoCNRZ1066

StreptococcusThermoLMG_18311

Firmicutes

Synechococcus_elongatPCC6301Fusobacterium_nucleatum

Porphyromonas_gingivalW83ProchlorococcusMarinCCMP1375

ProchlorococcusMarinMIT9313ProchlorococcusMarinMED4

Bacteria-miscellan.

Methanosarcina_acetivoransMethanosarcina_mazei

Methanobac_thermoautotrophiSulfolobus_solfataricus

Sulfolobus_tokodaii

Archaea

Symbiobact_thermoIAM14863Leifsonia_xyli_xyli_CTCB0Mycobact_avi_paratuberculo

Streptomyces_coelicolorStreptomyces_avermitilis

Corynebacterium_efficYS314Nocardia_farcinicaIFM10152

Corynebacterium_glutamicumMycobact_tuberculosisH37RvPropionibactAcnKPA171202

Mycobact_lepraeBifidobacterium_longum

Corynebacterium_diphtheriaeTropheryma_whippleiTW0827

Actinobacteria

GG

-3 F

lage

llar

bios

ynth

etic

pro

tein

flhB

GG

-4 F

lage

llar

bios

ynth

etic

pro

tein

flhA

GG

-5 F

lage

llar

bios

ynth

etic

pro

tein

fliP

GG

-22

Fla

gella

r bi

osyn

thet

ic p

rote

in fl

iRG

G-5

6 F

lage

llar

bios

ynth

etic

pro

tein

fliQ

GG

-6 F

lage

llar

hook

flgE

/F/G

GG

-7 F

lage

llar

mot

or s

witc

h fli

GG

G-1

0 F

lage

llar

basa

l–bo

dy r

od fl

gCG

G-1

2 F

lage

llar

MS

–rin

g fli

FG

G-1

3 F

lage

llar

hook

-ass

ocia

ted

prot

ein

1 flg

KG

G-1

8 F

lage

llar

mot

or s

witc

h fli

NG

G-2

1 F

lage

llar

mot

or s

witc

h fli

MG

G-2

7 F

lage

llar

hook

-ass

ocia

ted

prot

ein

3 flg

LG

G-2

9 F

lage

llar

hook

-ass

ocia

ted

prot

ein

2 fli

DG

G-8

Fla

gelli

n fli

CG

G-1

7 M

otili

ty p

rote

in A

mot

AG

G-7

4 F

lage

llar

prot

ein

fliS

GG

-20

Mot

ility

pro

tein

B m

otB

GG

-1 M

ethy

l-acc

eptin

g ch

emot

axis

pro

tein

GG

-11

Che

mot

axis

pro

tein

che

AG

G-4

5 M

ethy

l-acc

eptin

g ch

emot

axis

pro

tein

GG

-73

Met

hyl-a

ccep

ting

chem

otax

is p

rote

inG

G-3

8 C

hem

otax

is p

rote

in c

heV

GG

-15

Che

mot

axis

pro

tein

che

WG

G-2

Che

mot

axis

met

hyltr

ansf

eras

e ch

eRG

G-3

0 G

luta

mat

e m

ethy

lest

eras

e ch

eBG

G-3

2 F

lage

llar

L-rin

g pr

otei

n pr

ecur

sor

flgH

GG

-36

Fla

gella

r P

-rin

g pr

otei

n pr

ecur

sor

flgI

GG

-9 R

NA

-pol

ymer

ase

sigm

a-54

fact

orG

G-1

4 T

rans

crip

tion

fact

or, s

igm

a-54

dep

ende

nt

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

A B

Modularity from genotype–phenotype associationN Slonim et al

msb4100047-E4 Molecular Systems Biology 2006 & 2006 EMBO and Nature Publishing Group

2006.0005

from anaerobic species, since this complex carries out the finalstep in O2 respiration.

Similar analysis of the anaerobic phenotype yields 20modules (Supplementary Figures S7 and S8), three of whichare presented in the lower part of Figure 5. The first includesfour GGs, all associated with ferrodoxin oxidoreductaseactivity; the second consists of three subunits of V-typesodium ATP synthase; and the third includes two iron-sulfurflavoprotein GGs, which are present almost exclusively instrict anaerobes.

The analysis of the facultative phenotype yields 22 modules(Supplementary Figures S9 and S10), two of which—with aperfect average joint-assignment probability of 1.0—arepresented in the middle part of Figure 5. The first modulecorresponds to specific components in a phosphotransferasesystem (PTS), involved in uptake and phosphorylation ofcarbohydrates cellobiose and lactose. The second modulecorresponds to a different complex in the PTS system, involvedin the uptake of other carbohydrates like galactosamin,mannose, and fructose. Several other modules of facultative

GGs are also enriched in PTS system components (Supple-mentary Figure S9). The lack of any known mechanistic linkbetween PTS-mediated carbohydrate transport and the exam-ined facultative phenotype may reflect an indirect relationship.For example, it may correspond to the observed tendency offacultative bacteria in our data to inhabit carbohydrate-richenvironments within multicellular hosts.

Cellular differentiation: endospore formation

Some bacteria are capable of producing endospores—restingstructures that are formed by an unusual asymmetric celldivision, followed by engulfment of the smaller cell (theprespore) by the mother cell (see, Errington, 2003, for areview). Endospores are the most resistant biological struc-tures known; they can survive high temperatures, radiation,and chemical solvents, and can remain dormant for manyyears (Vreeland et al, 2000). Only 17 bacteria in our data wereannotated as capable of forming endospores, most of themfrom the genus Bacillus. Nevertheless, our analysis reveals

Figure 2 Results for motility GGs. (A) Phylogenetic profiles of the five most informative modules. Rows correspond to motility GGs and columns to organisms; eachentry indicates whether a particular motility GG is represented in the genome of a specific organism (red for motile and yellow for non-motile); the green bars indicate thecorrelation (in bits) between every motility GG and the motility phenotype; the average joint-assignment probability in each module is specified on the right (see Materialsand methods). (B) Every entry in this matrix indicates the probability of two motility GGs to be placed in the same cluster by the clustering algorithm, that is, their joint-assignment probability.

FliD/GG-29

FliC/GG-8

FlgL/GG-27

FlgK/GG-13

FlgE/GG-6

FlgG/GG-6

FlgH/GG-32

FlgF/GG-6

FlgI/GG-36

MotA/GG-17

MotB/GG-20

FlhA/GG-4

FlhB/GG-3

FliH

FliM/GG-21

FliN/GG-18

FliI

FliO

FliP/GG-5

FliR/GG-22

FliQ/GG-56

Ou

ter m

emb

ran

e

Pep

tid

og

lyca

n

Inn

er m

emb

ran

e

FliE

FlgB

FlgC/GG-10

FliF/GG-12

FliG/GG-7

Aer/GG-1 CheB/GG-30

CheA/GG-11

CheW/GG-15

–m

+P

+P

–P

+m

Tar/GG-1

Trg/GG-1

Tap/GG-1

CheZ

CheY

Tsr/GG-1 CheR/GG-2

Chemotaxis pathway Flagellar apparatus

A

B Name GG p–g–

p+g–

p–g+

p+g+ MI Function

FliP 5 50 22 3 71 0.36 Protein export FliE –– 51 70 2 23 0.05 Rod MS-ring junction FliH –– 52 76 1 17 0.04 Protein export FliO –– 53 85 0 8 0.04 Protein export FliI –– 0 0 53 93 0.00 Protein export ATPase FlgC 10 49 22 4 71 0.31 Proximal rod FlgB –– 51 54 2 39 0.11 Proximal rod CheA 11 46 19 7 74 0.30 Adds P to CheB/CheY CheZ –– 52 68 1 25 0.08 Removes P from CheY-P CheY –– 18 22 35 71 0.00 Promotes CW rotation

Figure 3 Motility GGs in the E. coli genome. (A) Depiction of the chemotaxis pathway and flagellar apparatus in E. coli. Genes detected by our approach arehighlighted in red. (B) Count matrices of known motility genes in E. coli, not detected by our approach; red rows correspond to genes that were detected by our analysis,and are shown here for comparison; the third ‘p� g�’ column indicates the number of organisms in our data, without the phenotype and without the gene; the next threecolumns are defined similarly; the seventh column indicates the gene–phenotype correlation (in bits); some genes (fliO, cheZ) are too specific, while others (fliI, cheY)are too abundant.

Modularity from genotype–phenotype associationN Slonim et al

& 2006 EMBO and Nature Publishing Group Molecular Systems Biology 2006 msb4100047-E5

2006.0005

GG-2

3-D

eoxy

-man

no-o

ctul

oson

ate

cytid

ylyl

tran

sfer

ase GG-3

UD

P-3-

O g

luco

sam

ine

N-a

cyltr

ansf

eras

e GG-4

Lip

id-A

-dis

acch

arid

e sy

ntha

se GG-5

Poly

sial

ic a

cid

caps

ule

expr

essi

on p

rote

in GG-7

UD

P-3-

O N

-ace

tylg

luco

sam

ine

deac

etyl

ase GG-8

3-de

oxy-

-m

anno

-oct

ulos

onic

-aci

d tr

ansf

eras

e GG-11

Tet

raac

yldi

sacc

hari

de 4

´-ki

nase GG-1

Out

er m

embr

ane

prot

ein

yaeT GG-9

PAL

pep

tidog

lyca

n-as

soci

ated

lipo

prot

ein GG-10

tolQ

/exb

B p

rote

in GG-12

tolB

pro

tein GG-72

Lip

id-A

bio

synt

hesi

s la

uroy

l acy

ltran

sfer

ase GG-20

Hly

D f

amily

sec

retio

n pr

otei

n GG-96

Hly

D f

amily

sec

retio

n pr

otei

n GG-53

Hly

D f

amily

sec

retio

n pr

otei

n GG-111

Mem

bran

e fu

sion

pro

tein

(M

FP) GG-15

Pyri

doxa

l pho

spha

te b

iosy

nthe

tic p

rote

in GG-52

Pyri

doxa

l pho

spha

te b

iosy

nthe

tic p

rote

in GG-35

AB

C tr

ansp

orte

r, p

erm

ease GG-68

Glu

tare

doxi

n 3 GG-29

2-O

ctap

reny

l-6-

met

hoxy

phen

ol h

ydro

xyla

se GG-31

Glu

tath

ione

syn

thet

ase GG-18

Glu

tare

doxi

n re

late

d pr

otei

n GG-73

Cop

ropo

rphy

rino

gen

III

oxid

ase,

aer

obic GG-107

Hyd

roxy

acyl

glut

athi

one

hydr

olas

e

Gra

m-n

egat

ive

Gra

m-p

ositi

ve0.

00.

5 bi

ts

0.95

0.95

0.96

0.97

γ-Proteobacteria

α-Proteobacteria

β-Proteobacteria δ/ε-Proteobacteria

Firmicutes Actinobacteria

Bacteria-miscellan.

Firmicutes

Actinobacteria Bacteria-miscellan.

LpxA

/GG

-3Lp

xC/G

G-7

LpxD

/GG

-3Lp

xH

LpxB

/GG

-4K

dtA

/GG

-8K

dtA

/GG

-8

LpxL

/GG

-72

Kd

sAY

rbI/

GG

-66

Kd

sB/G

G-2

LpxM

/G

G-7

2

Name

GG

p–

g–

p+

g–

p–

g+

p+

g+

MI

Function

LpxD

353

15

290

0.56

UDP-3-O--glucosamine N-

acyltransferase

LpxH

137

54

51

154

0.22

UDP-2,3-diacylglucosamine

hydrolase

KdsA

––

21

16

34

89

0.04

2-dehydro-3-

deoxyphosphooctulonate

aldolase

A B C

Fig

ure

4R

esul

tsfo

rGra

m-n

egat

ive

GG

s.(A

)Phy

loge

netic

profi

les

offo

urG

ram

-neg

ativ

eG

Gm

odul

es.E

ach

entr

yin

dica

tes

whe

ther

apa

rtic

ular

Gra

m-n

egat

ive

GG

isre

pres

ente

din

the

geno

me

ofa

spec

ific

orga

nism

(red

for

Gra

m-n

egat

ives

and

yello

wfo

rG

ram

-pos

itive

s).

(B)

Dep

ictio

nof

the

lipid

-Abi

osyn

thes

ispa

thw

ayin

E.coli.

Gen

esde

tect

edby

our

appr

oach

are

high

light

edin

red.

(C)

Cou

ntm

atric

esof

know

nlip

id-A

bios

ynth

esis

E.

coli

gene

s,no

tde

tect

edby

our

appr

oach

(red

row

forlp

xD

ispr

esen

ted

for

com

paris

on).

See

Fig

ure

2Bfo

rco

lum

nsde

scrip

tions

.

Modularity from genotype–phenotype associationN Slonim et al

msb4100047-E6 Molecular Systems Biology 2006 & 2006 EMBO and Nature Publishing Group

2006.0005

Aer

obic

Facu

ltativ

eA

naer

obic

0.0

0.5

bits

0.98

0.96

1.00

1.00

0.95

0.96

0.98

γ-Proteobacteria

α-Proteobacteria

β-proteobacteria

δ/ε-ProteobacteriaFirm

icutesBacteria-miscellan.

Archaea Actinobacteria

γ-Proteobacteria

α-Proteobacteria

β-Proteobacteria

Firmicutes

Archaea

Actinobacteria

γ-proteobacteria

δ/ε-ProteobacteriaFirm

icutesBacteria-miscellan.

Archaea

Fig

ure

5E

xam

ples

ofG

Gm

odul

esob

tain

edfo

rthe

thre

ere

spira

tion

phen

otyp

es.T

hetw

om

odul

esat

the

uppe

rpar

t(w

ithsu

ffix

‘ae’

)cor

resp

ond

totw

oae

robi

cG

Gm

odul

es;t

hetw

om

odul

esin

the

mid

dle

(with

suffi

x‘fa

’)co

rres

pond

totw

ofa

culta

tive

GG

mod

ules

;the

thre

em

odul

esat

the

botto

m(w

ithsu

ffix

‘an’

)cor

resp

ond

toth

ree

anae

robi

cG

Gm

odul

es.E

ach

entr

yin

dica

tes

whe

ther

aG

Gis

repr

esen

ted

inth

ege

nom

eof

asp

ecifi

cor

gani

sm(r

edfo

rst

rict

aero

bes,

oran

gefo

rfa

culta

tives

,an

dye

llow

for

stric

tan

aero

bes)

.

Modularity from genotype–phenotype associationN Slonim et al

& 2006 EMBO and Nature Publishing Group Molecular Systems Biology 2006 msb4100047-E7

2006.0005

genes associated with this phenotype with an encouraginglyhigh precision.

The cross-genome list for this phenotype consists of 850genes collected from all 17 endospore-forming bacteria.These 850 genes correspond to 145 groups of homologs, or

endospore GGs. These GGs were clustered as before,yielding 13 robust modules (Supplementary Figures S11 andS12). The three most robust modules, namely with themaximal average joint-assignment probability, are presentedin Figure 6.

GG-8Sporulation-blocking protein yabP GG-130Sporulation sigma-E factor processing peptidase GG-58 Stage III sporulation protein AC GG-6Stage III sporulation protein AD GG-3Stage III sporulation protein D GG-63 Spore-cortex-lytic enzyme GG-87 Spore germination protein GG-104Spore protease GG-136Spore protease related GG-71 Stage III sporulation protein AB GG-103Stage III sporulation protein AE GG-132Stage III sporulation protein AG GG-95 Stage II sporulation protein E GG-137Stage II sporulation protein M GG-11 Stage II sporulation protein P GG-134Stage II sporulation protein R GG-135Stage IV sporulation protein GG-76 Stage IV sporulation protein A GG-46 Stage IV sporulation protein B GG-40 Stage V sporulation protein AC GG-34 Stage V sporulation protein AD GG-15 Stage V sporulation protein AF GG-37 Translocation-enhancing protein GG-94 Hypothetical membrane protein GG-127Hypothetical membrane protein GG-54 abi, CAAX amino-terminal protease GG-42 Cytochrome C-550/C-551 GG-53 Cytochrome C oxidase subunit IV GG-124Hypothetical membrane proetin GG-50 Lipoprotein, putative GG-36 Menaquinol-cytochrome C reductase qcrC GG-18 Prespore-specific transcriptional regulator GG-66 Putative lipoprotein GG-56 Putative ribonuclease H GG-26 Reductase ribT/acetyltransferase gnaT GG-49 Small acid-soluble spore protein I sspI GG-69 spoVID-dependent spore coat assembly factor GG-101Spore coat protein GG-52 Spore coat protein E GG-99 Spore coat related, putative GG-97 Spore cortex biosynthesis, putative GG-84 Spore germination protein GG-90 Spore germination protein GG-55 Spore germination protein C1 GG-62 Sporulation initiation phosphotransferase GG-113Stage III sporulation protein AF GG-64 Stage IV sporulation protein FA GG-91 Stage VI sporulation protein D GG-118Hypothetical membrane protein GG-29 Hypothetical cytosolic protein GG-38 Hypothetical cytosolic protein GG-120Hypothetical cytosolic protein GG-24 Hypothetical protein GG-27 Hypothetical protein GG-28 Hypothetical protein GG-30 Hypothetical protein GG-31 Hypothetical protein GG-32 Hypothetical protein GG-33 Hypothetical protein GG-41 Hypothetical protein GG-43 Hypothetical protein GG-47 Hypothetical protein GG-60 Hypothetical protein GG-61 Hypothetical protein GG-65 Hypothetical protein GG-67 Hypothetical protein GG-68 Hypothetical protein GG-70 Hypothetical protein GG-72 Hypothetical protein GG-73 Hypothetical protein GG-83 Hypothetical protein GG-88 Hypothetical protein, HD domain GG-100Hypothetical protein (ecsc) GG-114Hypothetical protein GG-116Hypothetical protein GG-117Hypothetical protein

Endospores No Endospores0.0 0.5 bits

1.00

1.00

1.00

Firm

icut

esA

ctin

obac

teria

γ-Pro

teob

acte

ria

α-Pr

oteo

bact

eria

δ/ε-

Prot

eoba

cter

ia

Firm

icut

es

Bac

teria

-mis

cella

n.

Arc

haea

Act

inob

acte

ria

Figure 6 Phylogenetic profiles of the three most robust modules of endospore GGs. Each entry indicates whether an endospore GG is represented in the genome ofa specific organism (red for sporulating organisms and yellow for nonsporulating ones).

Modularity from genotype–phenotype associationN Slonim et al

msb4100047-E8 Molecular Systems Biology 2006 & 2006 EMBO and Nature Publishing Group

2006.0005

The first two modules are dominant among the genera ofBacillus and the anaerobic Clostridia. Remarkably, 22 outof the 25 GGs in these two modules are known to play a rolein sporulation, where two of the three remaining GGs arepoorly characterized. Note that the Clostridium tetanispecies is known to be capable of forming spores, but theparticular sequenced strain in our data is a nonsporulatingvariant, and thus was annotated as such by NCBI. Therefore,it is not surprising that this nominally nonsporulatingspecies is represented in one of these two modules. Inaddition, both modules are found in the Symbiobacteriumthermophilum genome, which according to NCBI is notcapable of forming endospores. However, it was recentlyshown that this organism indeed forms endospore-likecellular structures (Ueda et al, 2004), in agreement withour results.

The third module is limited to the Bacillus genera. Never-theless, 14 out of the 19 characterized GGs in this module arealso known to be associated with sporulation; in addition, 29GGs in this module correspond to groups of homologous geneswhere all group members are poorly characterized to date.Finally, we note that the three spore-forming Actinobacteriaare not represented in these three modules; in fact, these threespecies are represented only in a few modules in Supplemen-tary Figure S11. Although the genetic basis of this phenotypehas not been extensively studied in these species, it is knownto be different than the canonical pathways in Bacillus subtilis(Sonenshein, 2002). Thus, recovering the genetic basisof sporulation in this class will likely require more thanjust the three complete genome sequences that were availablein our data.

Pathogenicity: intracellular

Intracellular pathogenicity is a complex, largely unexploredphenotype of great medical interest. It involves multipleinteractions between the bacteria and the host eukaryotic cells(e.g., cell invasion, host-bacteria small-molecule transfer, andimmune evasion), and is therefore likely to correspond to adiverse set of molecular mechanisms. As shown below, themodules predicted by our approach capture the diversity ofthese interactions, and in several cases allow us to makefunctional predictions.

The cross-genome list for this phenotype consists of 1178genes collected from 47 intracellular pathogenic bacteria.These 1178 genes correspond to 224 groups of homologs, orintracellular pathogenesis GGs. These GGs were clustered asbefore, yielding 19 robust modules (Supplementary FiguresS13 and S14), six of which are presented in Figure 7.

The first two modules in Figure 7 correspond to differentmembers of the type III secretion system. The type III secretionsystem was thoroughly studied in Salmonella enterica (Galan,2001); it directs the translocation of bacterial proteins, termedeffector proteins, into the host cell. These effector proteinscarry out several distinct roles like modulating the actincytoskeleton to facilitate bacterial entry into non-phagocyticcells (Fu and Galan, 1999). Moreover, the type III secretionsystem remains active after internalization in order to deliverproteins into the cytosol of the host cell (Collazo and Galan,1997). The first of these two modules, present in many

intracellular pathogenic g-proteobacteria and in twob-proteobacteria, but in none of the extracellular pathogensof the same phylogenetic groups, contains three GGsassociated with components of the type III secretion system,and two GGs that correspond to effector proteins. The secondmodule is more generic and includes components of the typeIII secretion system that are also present in Chlamidiaegenomes. Interestingly, one of the two GGs in this modulecorresponds to a low calcium response chaperone. Suchchaperones have been shown to bind effector proteins in thebacterial cytosol, and may be involved in stabilizing thesemolecules or preventing their interactions with other proteins(Mecsas and Strauss, 1996).

The presence of several phage-related GGs in the third andfourth modules of Figure 7 is intriguing, as it has been shownthat a type III effector protein is encoded within the genome ofa cryptic phage present in the Salmonella typhimuriumgenome (Hardt et al, 1998). It is possible that the phage-related GGs in Figure 7 are not directly involved in intracellularpathogenicity, but have been cotransferred along with certaintype III effector proteins. Nevertheless, the significant associa-tion of several of these GGs with the phenotype providescorroborating evidence that phage-mediated transfer ofgenetic material may play an important role in host cellinvasion by pathogenic intracellular bacteria (Hardt et al,1998).

The fifth module in Figure 7 consists of two GGs that arepresent in all the obligate intracellular Rickettsia andChlamidia in our data. One of these two GGs is uncharacter-ized, while the other is associated with ADP/ATP carrierproteins. In obligate intracellular species, these proteins areknown to take up ATP in exchange for ADP within the cytosolof their eukaryotic hosts.

Finally, the sixth module in Figure 7 consists of GGs presentin almost all pathogenic intracellular Actinobacteria, butin none of the extracellular pathogens. This module containsseveral membrane-associated GGs, but almost all of them areuncharacterized in all the respective genomes. Interestingly,one of the GGs in this module, GG-195, includes the lprGlipoprotein in Mycobacterium tuberculosis; lprG was recentlyshown to inhibit MHC-II antigen processing in humanmacrophages, thus possibly contributing to immune evasionof M. tuberculosis (Gehring et al, 2004). It will be interesting toexperimentally test whether the other members of this modulealso have similar molecular functions, as predicted by ourapproach.

Discussion

Our capacity to sequence genomes has far outpaced our abilityto understand their biology (Margulies et al, 2005; Shendureet al, 2005). Although functional genomic strategies showenormous promise when it comes to model organisms,adapting them to the vast majority of species of biomedicaland industrial interest presents a formidable challenge. Here,we have shown that our computational approach provides aneffective alternative in revealing the genetic basis of a varietyof phenotypic traits and behaviors. Specifically, our approachrecovers distinct gene modules that are associated with the

Modularity from genotype–phenotype associationN Slonim et al

& 2006 EMBO and Nature Publishing Group Molecular Systems Biology 2006 msb4100047-E9

2006.0005

GG-132

Path

ogen

icity

isla

nd 1

eff

ecto

r to

ent

er e

pith

elia

l cel

ls GG-20

Typ

e iii

sec

retio

n re

gula

tor,

ara

c-fa

mily GG-131

Typ

e iii

sec

retio

n sy

stem

, 1 is

land

eff

ecto

r GG-55

Typ

e iii

sec

retio

n sy

stem

, cel

l inv

asio

n GG-199

Typ

e iii

sec

retio

n, p

atho

geni

city

1 is

land

eff

ecto

r GG-196

Typ

e iii

sec

retio

n ch

aper

one,

low

cal

cium

res

pons

e GG-28

Typ

e iii

sec

retio

n sy

stem

, vir

ulen

ce a

ssoc

iate

d GG-15

Prop

hage

cp-

933k

/v/w

/y e

ncod

ed GG-48

Prop

hage

cp-

933p

/w, e

xcis

iona

se GG-169

Hyp

othe

tical

pro

tein GG-116

Mem

bran

e, o

uter

mem

bran

e, p

utat

ive GG-202

Poly

ferr

edox

in, p

utat

ive GG-180

Prop

hage

cp-

933n

enc

oded GG-113

Hyp

othe

tical

pro

tein

, ces

t GG-21

adp,

atp

carr

ier

prot

ein GG-7

Hyp

othe

tical

pro

tein

, yag

e fa

mily GG-191

Hyp

othe

tical

pro

tein GG-190

Exp

orte

d pr

otei

n, p

utat

ive GG-195

Lip

opro

tein

lprG

, put

ativ

e GG-189

Mem

bran

e se

cret

ed p

rote

in, p

utat

ive GG-194

Mem

bran

e, tr

ansm

embr

ane,

put

ativ

e GG-192

Secr

eted

pro

tein

, put

ativ

e GG-111

Hyp

othe

tical

pro

tein

Path

ogen

ic

intr

acel

lula

rPa

thog

enic

ex

trac

ellu

lar

0.0

0.5

bits

0.98

1.00

0.93

0.95

0.91

1.00

Salmonel_typhiSalmonel_typhiTy2

Salmonel_typhimuriumLT2E_coli_O157H7

E_coli_O157H7_EDL933Salmonel_enter_ParaATCC9150

Shigella_flexneri2aYersin_pestisCO92

Yersin_pestisBiovarMediaevYersin_pestisKIM

Legionella_pneumophil_LensLegionella_pneumophil_ParisLegionella_pneumophil_Phil1

Shigella_flexneri2a_2457TCoxiella_burnetii

Francis_tularen_tularensis

γ-Proteobacteria

Rickettsia_conoriiRickettsia_prowazekii

Rickettsia_typhi_wilmingtonBrucella_melitensisBrucella_suis_1330

Bartonella_hens_Houston1Bartonella_quintToulouse

Ehrlichi_ruminant_Welgevon

α-Proteobacteria

BurkholdPseudoMallK96243BurkholdMallATCC23344

β-ProteobacteriaListeria_monocytogenes

Listeria_monocytogen4bF2365Mycoplasma_mycoides

Mycoplasma_mobile163KMycoplasma_hyopneumonia232

Mycoplasma_genitaliumMycoplasma_penetrans

Onion_yellows_phytoplasmaMycoplasma_gallisepticum

Firmicutes

Mycobact_tuberculosisH37RvMycobact_avi_paratuberculoNocardia_farcinicaIFM10152

Mycobact_lepraeTropheryma_whippleiTW0827

Actinobacteria

Chlamydia_muridarumChlamydia_trachomatis

Chlamydophil_caviaeChlamydophil_pneumonCWL029

Parachlamydia_spUWE25Borrelia_burgdorferiBorrelia_garinii_PBi

Bacteria-miscellan.

Yersin_pseTuberculosIP32953Photorhabdus_luminescens

Pseudomonas_aeruginosaVibrio_parahaemolyticus

Erwin_caroto_atrosSCRI1043Pseudomonas_syringae

E_coli_CFT073Vibrio_vulnificusCMCP6

Vibrio_vulnificusYJ016Vibrio_cholerae

Pasteurella_multocidaXanthomonas_campestris

Xanthomonas_citriHaemophilus_ducreyi35000HP

Haemophilus_influenzaeXylella_fastidiosa

Xylella_fastidiosa_Temecula1

γ-Proteobacteria

Agrobact_tumef_C58UWashAgrobact_tumef_C58Cereon

α-ProteobacteriaBordetella_bronchisepticBordetella_parapertussis

Bordetella_pertussisRalstonia_solanacearum

Neisseri_meningitidisMC58Neisseri_meningitidisZ2491

β-ProteobacteriaHelicobacter_hepaticus

Campylobact_jejuniCampylobact_jejuniRM1221

Helicobacter_pylori_26695Helicobacter_pylori_J99

δ/ε-Proteobacteria

Clostridium_perfringensClostridium_tetani_E88

Enterococcus_faecalis_V583StaphylococcusEpidATCC12228

StreptococcusMutansMycoplasma_pneumoniae

Mycoplasma_pulmonisStreptococcusAgalacti2603

Ureaplasma_urealyticumBacillus_anthracis_Ames

Bacillus_cereus_ATCC10987StreptococcusAgalactiNEM316

Bacillus_anthracis_A2012Bacillus_anthracis_Sterne

Bacillus_cereus_ATCC14579Bacillus_cereus_ZK

Bacillus_thuring_konkukianStaphylococcusAurCOL

StaphylococcusAurMW2StaphylococcusAurMu50

StaphylococcusAurAurMRSA252StreptococcusPneumoniaR6

StreptococcusPneumoniaTIGR4StreptococcusPyogenesM1GASStreptococcusPyogMGAS10394

StreptococcusPyogMGAS315StreptococcusPyogMGAS8232

StreptococcusPyogSSI-1

Firmicutes

PropionibactAcnKPA171202Corynebacterium_diphtheriaeLeifsonia_xyli_xyli_CTCB0

ActinobacteriaTreponema_pallidum

Leptospir_inter_serovCopenhLeptospir_inter_serovLai

Porphyromonas_gingivalW83Treponema_denticATCC35405

Fusobacterium_nucleatum

Bacteria-miscellan.

Fig

ure

7P

hylo

gene

ticpr

ofile

sof

six

robu

stm

odul

esof

intr

acel

lula

rpat

hoge

nici

tyG

Gs.

Eac

hen

try

indi

cate

sw

heth

eran

intr

acel

lula

rpat

hoge

nici

tyG

Gis

repr

esen

ted

inth

ege

nom

eof

asp

ecifi

cor

gani

sm(r

edfo

rint

race

llula

rpa

thog

ens,

yello

wfo

rex

trac

ellu

lar

path

ogen

s).

Modularity from genotype–phenotype associationN Slonim et al

msb4100047-E10 Molecular Systems Biology 2006 & 2006 EMBO and Nature Publishing Group

2006.0005

considered phenotype in a generic sense, that is, across most ofthe species expressing the phenotype.

The modules we find map directly to enzymatic pathways(lipid-A biosynthesis), molecular complexes (NADH dehydro-genase), signaling pathways (chemotaxis system), and mole-cular machines (bacterial flagellum, type III secretion system).It is important to emphasize that these modules are theproducts of an unsupervised clustering process. The pheno-type-associated GGs are clustered numerous times underdifferent initial conditions, and the reported modules corre-spond to sets of GGs that are consistently placed by thealgorithm in the same cluster, regardless of its initialization.

The phylogenetic signatures of these modules provideinsights into dominant evolutionary trends in their utilization.For example, many chemotaxis and flagella genes havephylogenetic profiles that naturally correlate with motility,but our analysis separates them into two distinct and robustmodules, implying that they actually correlate with motilityin different ways (Figure 2). In fact, as we see, the genericchemotaxis module—as a module—may couple with othertypes of flagella (e.g., those of motile archaea) or motilitymechanisms, while the generic flagella module may couplewith other types of chemotaxis systems (e.g., in the Legionellagenomes).

Similar modularity patterns were observed essentially in allthe phenotypes that we examined, most of which are onlypartially understood at present. Thus, our results support thenotion of modularity in molecular systems (Hartwell et al,1999; Ravasz et al, 2002) and demonstrate how such moduleshave acquired distinguishable co-inheritance patternsthroughout evolution. In cases where the molecular mechan-isms underlying a phenotype are relatively uncharacterized(e.g., pathogenic intracellular bacteria), the modularityrevealed by our approach provides concrete hypotheses thatcan be used to design more focused experiments. For example,the uncharacterized membrane proteins of the last module inFigure 7 should be tested first for their ability to inhibit MHC-IIantigen processing in human macrophages, as the onlycharacterized member of this module was shown to have thisproperty. On the other hand, experiments designed to test theinvolvement of these genes in type III secretion may be lessfruitful, as the members of this system are dominant in twoother modules in Figure 7, both of which have very differentphylogenetic signatures.

At the time this report was submitted (October 2005), 266complete microbial genomes were already available throughNCBI and the sequencing of 549 others is reported as being inprogress. As more whole genome sequences become available,we expect that methods like the one presented here will allowthe detection of increasingly precise modularity, as largecohorts of new bacterial genomes will be shown to have lost orgained entire groups of genes involved in similar functions.Meanwhile, methodologies for high-throughput phenotypicannotation are being actively investigated, and are beingapplied to phenotypes of which very little is understood atpresent. One promising such methodology is the comprehen-sive mapping of microbial communities, either through rDNAsequencing (Eckburg et al, 2005) or low-coverage shotgunsequencing (Venter et al, 2004; Tringe et al, 2005). Thecomputational framework presented here should be instru-

mental in revealing modules of genes shared by membersof these communities (as opposed to non-members), whichare broadly used to sustain life and harness environmentalresources in their natural habitats. Systematic phenotypeannotations, such as those already available at NCBI, havebeen created only recently, and typically for phenotypes thathave been studied for decades, and that are relatively wellunderstood. It is plausible to assume that this trend will soonchange, as phenotype and genotype will be almost immedi-ately associated, and the corresponding genes will beautomatically grouped into functionally coherent modules,using methods such as ours.

While our approach can be applied to arbitrary phenotypes,one should bear in mind that association does not necessarilyreflect causality. In particular, the lifestyles and native habitatsof free-living microbes induce strong phenotype–phenotypecorrelations that could potentially confound the interpreta-tions of studies such as ours. The observed association of thePTS sugar transporter modules with the facultative phenotype(Figure 5) may correspond to this type of correlation, wherethe ability to thrive in both aerobic and anaerobic settings mayreflect the dominance of facultative bacteria in carbohydrate-rich environments. Indeed, most of the facultative microbesin our data are commensal or pathogenic bacteria that caninhabit nutrient-rich environments within the host. Inprinciple, such phenotype correlations can be explicitlymodeled once they are measured. However, measuring thesecorrelations will require considering many more phenotypes,in a much larger sample of species, than those currentlyavailable. Interestingly, a recent study demonstrated howphenotypic data for a diverse set of species can be auto-matically and successfully gathered from the literature (Korbelet al, 2005).

A related issue of concern is the degree to which the unevenphylogenetic distribution of a particular trait may confoundthe interpretation of our results. In a severe scenario, theapproach may identify species-distinguishing modules, ratherthan those that underlie a particular trait. For example, theBacillus species constitute most of the endospore-formingspecies in our data. Thus, a priori, one might be concerned thatthe Bacillus phylogenetic signature may overwhelm theendospore phenotype; nonetheless, as our results show, weidentify a large number of components known to be involvedin sporulation with an encouraging precision. For example, 22out of the 25 endospore GGs in the first two modules inFigure 6 correspond to known sporulation genes that are alsopresent in non-Bacillus-sporulating organisms. While theseobservations may not necessarily generalize to all phenotypes,rapid advances in sequencing efficiency (Margulies et al, 2005;Shendure et al, 2005) are expected to even out phylogeneticcoverage well beyond any conceivable concern here.

We have presented a computational framework for revealingthe underlying genetic architecture of a trait by characterizingits expression at the organism level, across many species. Ourcomplete set of results, available at our Web site, provides awealth of experimentally testable hypotheses that associategenes with complex traits via the simultaneous analysis ofmore than 200 complete genome sequences. Beyond its utilityfor generating hypotheses, our approach reveals an intrinsicmodularity in genetic networks, and highlights the extensive

Modularity from genotype–phenotype associationN Slonim et al

& 2006 EMBO and Nature Publishing Group Molecular Systems Biology 2006 msb4100047-E11

2006.0005

and broad sharing of optimized genetic modules across thetree of life. The utility of this approach depends crucially on thesuccess of efforts to systematically characterize the vastvariety of diverse phenotypic traits throughout the microbialbiosphere.

Materials and methods

Genome sequences and phenotype annotations

We downloaded the 214 complete microbial genome sequences thatwere available at the NCBI Web site (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi) on January 18, 2005. To avoid redundancy, foreach pair of highly similar genomes, we only retained the genome thatwas sequenced first. This led to a total of 202 genomes. Phenotypeannotations were also downloaded from the NCBI site. The intracel-lular pathogenicity phenotype was generated by manual curation,based on literature search, data available at NCBI, and data reported atthe IslandPath Web site at http://pathogenomics.sfu.ca/islandpath/current/IPindex.pl (Hsiao et al, 2003). All these phenotype annota-tions are available at our Web site, http://tavazoielab.princeton.edu/genphen/.

Creating the gene phylogenetic profiles

For every gene in every genome, we applied a BLASTsearch (Altschulet al, 1990) against all the remaining 201 genomes to identify possiblehomologs. Similarly to previous work (Jim et al, 2004), a genome wasconsidered as containing a homolog when one of its proteins had analignment to the query protein sequence with an e-value smaller than10�10. Proteins with less than 50 amino acids were ignored. Thisresulted in 591 640 phylogenetic profiles where each profile is a binaryvector with 202 elements. Preliminary tests indicated that using theraw BLAST e-values in the phylogenetic profiles (Date and Marcotte,2003) yields similar results in the analysis that follows. Therefore, allthe subsequent analysis used this binary representation. The entirephylogenetic profile collection is available at our Web site.

Estimating gene–phenotype mutual information

Given a gene phylogenetic profile and a phenotype profile, we candefine a count matrix, N, where N1,1 is the number of species with thephenotype and the gene, N1,2 is the number of species with thephenotype but without the gene, N2,1 is the number of species withoutthe phenotype but with the gene, and N2,1 is the number of specieswithout the phenotype and without the gene. The empirical mutualinformation between the profiles is given by

Iðgene; phenÞ ¼X

i;j

Pi;jlogPi;j=ðPi�PjÞ

where Pi;j¼Ni;j=P

i;j Ni;j, Pi¼Pi,1þ Pi,2, and Pj¼P1,jþ P2,j (Cover andThomas, 1991). This information is naturally normalized between 0and 1 bits, where 0 bits means no dependency while high informationvalues imply strong correlation between the gene and the phenotypeprofile. All the gene–phenotype information relations were estimatedthrough the direct method (Strong et al, 1998; Slonim et al, 2005a) inorder to correct for finite sample effects, using the software available athttp://www.genomics.princeton.edu/biophysics-theory/DirectMI/web-content/index.html with its default parameters. The same procedurewas applied for randomly shuffled gene phylogenetic profiles andthe maximum information value obtained was used as a thresholdfor significance. That is, the association of a gene with a pheno-type was considered significant if and only if their mutual informa-tion was found to be greater than the maximal value obtained inthe shuffled data. Importantly, in contrast to previously usedcorrelation measures (Huynen et al, 1998; Levesque et al, 2003;Makarova et al, 2003; Jim et al, 2004; Korbel et al, 2005), the mutualinformation can be equally applied to continuous phenotypeslike optimal temperature growth, and to measure the correlations

between sets of genes and phenotypes, as we plan to investigate in asubsequent study.

Finding the phenotype GGs

The construction of the phenotype GGs consists of two phases. In thefirst phase, for every phenotype, we collected the 50 genes with thestrongest positive correlation with the phenotype from every organismhaving the phenotype (as long as this correlation was significant), andjoined them into a single cross-genome list. We then defined asimilarity graph among all the genes in this list where two genes wereconnected by an edge if their corresponding BLAST e-value wassmaller than 10�10; next, an agglomerative merging process wasapplied to find strongly connected components in this graph.Specifically, the algorithm first assigns every gene in a singletongroup, and then recurrently performs the merger with the maximal‘score’, where the score of merging two groups of genes is defined asthe probability of having an edge between two genes chosenindependently from both groups. More formally, denoting both groupsby c1 and c2, the corresponding merger score is

1

jc1jjc2jX

g12c1

X

g22c2

Bðg1; g2Þ

where B(g1,g2) is 1 for homologous genes and 0 otherwise, and |ci|denotes the number of genes in each group. If more than one mergerattained the maximal score, the one resulting with the largest newgroup was preferred. We chose a score threshold of 0.5 as a stoppingcriterion for the merging process. Our analysis was highly robust withrespect to this parameter, where using score thresholds of up to 0.7gave identical results. The merging process results in groups ofhomologous genes, in which most gene pairs have a BLAST e-valuebelow 10�10. The average edge density in the resulting groups wasaround 99%, that is, most groups corresponded to almost fullyconnected components in the afore-mentioned BLAST similaritygraph. To further validate the robustness of these results, we usedthe BLASTClust software (available as part of the BLAST package) overthe same cross-genome lists. For all phenotypes, this resulted withgroups of homologous genes that were highly similar to thoseextracted by our merging algorithm.

In the second phase of our construction, each group of homologousgenes was further expanded to include additional homologs that werenot detected through the first phase. Specifically, a gene from a specieshaving the phenotype was added to a group if it had a BLAST e-valuebelow 10�10 with at least one-third of the original group members(again, different values of this parameter gave very similar results). Asa simple example, let us consider the case of the flgL gene in E. coli.This gene, involved in flagellar biosynthesis, obtained an informationscore of B0.16 bits over the motility phenotype, which was notsufficient for it to be included among the 50 E. coli genes that were mostinformative about motility. As a result, this gene was not included inthe cross-genome list, out of which we constructed the groups ofhomologous genes in the first phase. Nevertheless, several homologsof this gene in other species (e.g., flgL in Bacillus subtilis) obtainedhigher information scores that placed them among the 50 mostinformative genes about motility in their respective genomes. Thisgave rise to a group of flgL homologous genes that was constructed inthe first phase, to which the flgL gene in E. coli was added as anexpansion in this second phase. A summary file, describing all thegenes in every group, along with relevant details from NCBIannotation, is available at our Web site. The NCBI gene textualdescriptions were used to determine a concise textual title for everygroup. In principle, the genes in each group correspond to differentreflections of the same ancestral entity, with phylogenetic profiles thatstrongly correlate with the examined phenotype profile. Therefore,these groups are termed here phenotype generic genes (GGs).

Finding robust GG modules

A phenotype GG corresponds to a group of homologous genes, alltaken only from species that have the phenotype. However, construct-ing the GG phylogenetic profile (e.g., for the purpose of identifying GG

Modularity from genotype–phenotype associationN Slonim et al

msb4100047-E12 Molecular Systems Biology 2006 & 2006 EMBO and Nature Publishing Group

2006.0005

modules) requires that all species be considered. To that end, weapplied the following procedure. If more than one-third of the originalGG members had a BLAST e-value smaller than 10�10 in a particulargenome, the profile entry of the GG for this genome was set to 1.Otherwise, it was set to 0.

Given the GG phylogenetic profiles, we estimated the mutualinformation between every pair of GGs, and used these informationrelations as input to the Iclust clustering algorithm (Slonim et al,2005b) (manuscript and software available at http://www.genomics.princeton.edu/biophysics-theory/Clustering/web-content/index.html).This algorithm finds a partition of the GGs into clusters such that GGsin the same cluster are highly informative about each other, that is,have highly similar phylogenetic profiles.

The Iclust algorithm corresponds to a fully principled clusteringmethodology. However, as with any other clustering algorithm, itmay produce suboptimal solutions in a single run, depending on the(random) partition used in its initialization. To address this issue, weapplied the Iclust algorithm with default parameter values 1000 times,each with a different initial random partition, yielding potentially 1000(slightly) different clustering solutions. Next, for every pair of GGs wedefined the joint-assignment probability as the number of solutions inwhich the pair was placed by the algorithm in the same cluster, dividedby 1000. Thus, two GGs placed very often in the same cluster by thealgorithm will have a relatively high joint-assignment probability.Finally, we defined a graph between all the GGs where two GGs wereconnected by an edge if their corresponding joint-assignmentprobability was greater than 0.9, and used the merging processdescribed earlier to find fully connected components in this graph. Theresulting connected components with at least two GGs correspond tothe robust GG modules that we report and analyze in this study. Bydefinition, each such module corresponds to GGs consistently placedby the clustering algorithm in the same cluster, almost regardless of theinitial random partition used. Our analysis was relatively insensitive tovariations in the threshold used in this stage. For example, usinga joint-assignment probability threshold of 0.8 gave similar results forall phenotypes.

Supplementary information

Supplementary information is available at the Molecular SystemsBiology website (www.nature.com/msb).

AcknowledgementsWe are grateful to CS Chan, H Girgis, W Bialek, and the twoanonymous referees for many insightful comments on preliminaryversions of the manuscript. This work was supported in part by NIH,NSF, and DARPA.

References

Alberts B, Bray D, Lewis J, Raff M, Roberts K, Watson J (1994)Molecular Biology of the Cell. New York: Garland Science Publishing

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic localalignment search tool. J Mol Biol 215: 403–410

Berardi MJ, Bushweller JH (1999) Binding specificity and mechanisticinsight into glutaredoxin-catalyzed protein disulfide reduction.J Mol Biol 292: 151–161

Collazo CM, Galan JE (1997) The invasion-associated type III system ofSalmonella typhimurium directs the translocation of Sip proteinsinto the host cell. Mol Microbiol 24: 747–756

Copley SD, Dhillon JK (2002) Lateral gene transfer and parallelevolution in the history of glutathione biosynthesis genes. GenomeBiol 3: research0025.1–0025.16

Cover TM, Thomas JA (1991) Elements of Information Theory. NewYork: John Willey & Sons

Date SV, Marcotte EM (2003) Discovery of uncharacterized cellularsystems by genome-wide analysis of functional linkages. NatBiotechnol 21: 1055–1062

Dreyfus G, Williams AW, Kawagishi I, Macnab RM (1993) Genetic andbiochemical analysis of Salmonella typhimurium FliI, a flagellarprotein related to the catalytic subunit of the F0F1 ATPase and tovirulence proteins of mammalian and plant pathogens. J Bacteriol175: 3131–3138

Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, SargentM, Gill SR, Nelson KE, Relman DA (2005) Diversity of the humanintestinal microbial flora. Science 308: 1635–1638

Errington J (2003) Regulation of endospore formation in Bacillussubtilis. Nat Rev Microbiol 1: 117–126

Fu Y, Galan JE (1999) A salmonella protein antagonizes Rac-1 andCdc42 to mediate host-cell recovery after bacterial invasion. Nature401: 293–297

Galan JE (2001) Salmonella interactions with host cells: type IIIsecretion at work. Annu Rev Cell Dev Biol 17: 53–86

Gehring AJ, Dobos KM, Belisle JT, Harding CV, Boom WH (2004)Mycobacterium tuberculosis LprG (Rv1411c): a novel TLR-2 ligandthat inhibits human macrophage class II MHC antigen processing.J Immunol 173: 2660–2668

Genevrois S, Steeghs L, Roholl P, Letesson JJ, van der Ley P (2003) TheOmp85 protein of Neisseria meningitidis is required for lipid exportto the outer membrane. EMBO J 22: 1780–1789

Hardt WD, Urlaub H, Galan JE (1998) A substrate of the centisome 63type III protein secretion system of Salmonella typhimurium isencoded by a cryptic bacteriophage. Proc Natl Acad Sci USA 95:2574–2579

Hartwell LH, Hopfield JJ, Leibler S, Murray AW (1999) From molecularto modular cell biology. Nature 402 (Suppl): C47–C52

Hegermann J, Herrmann R, Mayer F (2002) Cytoskeletal elements inthe bacterium Mycoplasma pneumoniae. Naturwissenschaften 89:453–458

Hsiao W, Wan I, Jones SJ, Brinkman FS (2003) IslandPath: aiding detectionof genomic islands in prokaryotes. Bioinformatics 19: 418–420

Huynen M, Dandekar T, Bork P (1998) Differential genome analysisapplied to the species-specific features of Helicobacter pylori. FEBSLett 426: 1–5

Jagannathan A, Constantinidou C, Penn CW (2001) Roles of rpoN, fliA,and flgR in expression of flagella in Campylobacter jejuni.J Bacteriol 183: 2937–2942

Jim K, Parmar K, Singh M, Tavazoie S (2004) A cross-genomicapproach for systematic mapping of phenotypic traits to genes.Genome Res 14: 109–115

Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes andgenomes. Nucleic Acids Res 28: 27–30

Korbel JO, Doerks T, Jensen LJ, Perez-Iratxeta C, Kaczanowski S, HooperSD, Andrade MA, Bork P (2005) Systematic association of genes tophenotypes by genome and literature mining. PLoS Biol 3: e134

Lerat E, Daubin V, Ochman H, Moran NA (2005) Evolutionary originsof genomic repertoires in bacteria. PLoS Biol 3: e130

Levesque M, Shasha D, Kim W, Surette MG, Benfey PN (2003) Trait-to-gene: a computational method for predicting the function ofuncharacterized genes. Curr Biol 13: 129–133

Makarova KS, Wolf YI, Koonin EV (2003) Potential genomicdeterminants of hyperthermophily. Trends Genet 19: 172–176

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA,Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L,Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH,Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB,Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J,Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP,Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT,Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, TartaroKR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, WeinerMP, Yu P, Begley RF, Rothberg JM (2005) Genome sequencing inmicrofabricated high-density picolitre reactors. Nature 437:376–380

Modularity from genotype–phenotype associationN Slonim et al

& 2006 EMBO and Nature Publishing Group Molecular Systems Biology 2006 msb4100047-E13

2006.0005

Mecsas JJ, Strauss EJ (1996) Molecular mechanisms of bacterialvirulence: type III secretion and pathogenicity islands. Emerg InfectDis 2: 270–288

Nierman WC, DeShazer D, Kim HS, Tettelin H, Nelson KE, FeldblyumT, Ulrich RL, Ronning CM, Brinkac LM, Daugherty SC, DavidsenTD, Deboy RT, Dimitrov G, Dodson RJ, Durkin AS, Gwinn ML, HaftDH, Khouri H, Kolonay JF, Madupu R, Mohammoud Y, Nelson WC,Radune D, Romero CM, Sarria S, Selengut J, Shamblin C, SullivanSA, White O, Yu Y, Zafar N, Zhou L, Fraser CM (2004) Structuralflexibility in the Burkholderia mallei genome. Proc Natl Acad SciUSA 101: 14246–14251

Nikaido H (2003) Molecular basis of bacterial outer membranepermeability revisited. Microbiol Mol Biol Rev 67: 593–656

Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO(1999) Assigning protein functions by comparative genomeanalysis: protein phylogenetic profiles. Proc Natl Acad Sci USA96: 4285–4288

Pittman MS, Goodwin M, Kelly DJ (2001) Chemotaxis in the humangastric pathogen Helicobacter pylori: different roles for CheW, thethree CheV paralogues, and evidence for CheV2 phosphorylation.Microbiology 147 (Part 9): 2493–2504

Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL (2002)Hierarchical organization of modularity in metabolic networks.Science 297: 1551–1555

Ruiz N, Falcone B, Kahne D, Silhavy TJ (2005) Chemicalconditionality: a genetic strategy to probe organelle assembly.Cell 121: 307–317

Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, RosenbaumAM, Wang MD, Zhang K, Mitra RD, Church GM (2005) Accuratemultiplex polony sequencing of an evolved bacterial genome.Science 309: 1728–1732

Shigenobu S, Watanabe H, Hattori M, Sakaki Y, Ishikawa H (2000)Genome sequence of the endocellular bacterial symbiont of aphidsBuchnera sp. APS. Nature 407: 81–86

Slonim N, Atwal GS, Tkacik G, Bialek W (2005a) Estimating mutualinformation and multi-information in large networks. http://arxiv.org/abs/cs.IT/0502017

Slonim N, Atwal GS, Tkacik G, Bialek W (2005b) Information basedclustering. Proc Natl Acad Sci USA 102: 18297–18302

Sonenshein AL (2002) Developmental biology: regulation by selectivegene localization. Curr Biol 12: R90–R92

Strong SP, Koberle R, de Ruyter van Steveninck RR, Bialek W (1998)Entropy and information in neural spike trains. Phys Rev Lett 80:197–200

Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, ChangHW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P,Rubin EM (2005) Comparative metagenomics of microbialcommunities. Science 308: 554–557

Ueda K, Yamashita A, Ishikawa J, Shimada M, Watsuji TO,Morimura K, Ikeda H, Hattori M, Beppu T (2004) Genomesequence of Symbiobacterium thermophilum, an uncultivablebacterium that depends on microbial commensalism. NucleicAcids Res 32: 4937–4944

Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA,Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH,Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R,Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO (2004)Environmental genome shotgun sequencing of the Sargasso Sea.Science 304: 66–74

Vreeland RH, Rosenzweig WD, Powers DW (2000) Isolation of a 250million-year-old halotolerant bacterium from a primary salt crystal.Nature 407: 897–900

Wolfe AJ, Millikan DS, Campbell JM, Visick KL (2004) Vibrio fischerisigma54 controls motility, biofilm formation, luminescence, andcolonization. Appl Environ Microbiol 70: 2520–2524

Wu T, Malinverni J, Ruiz N, Kim S, Silhavy TJ, Kahne D (2005)Identification of a multicomponent complex required for outermembrane biogenesis in Escherichia coli. Cell 121: 235–245

Modularity from genotype–phenotype associationN Slonim et al

msb4100047-E14 Molecular Systems Biology 2006 & 2006 EMBO and Nature Publishing Group

2006.0005