global properties of the metabolic map of escherichia coli · global properties of the metabolic...

9
Global Properties of the Metabolic Map of Escherichia coli Christos A. Ouzounis 1,3 and Peter D. Karp 2 1 Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, CB10 1SD, UK; 2 Bioinformatics Group, AI Center, SRI International, EK223, Menlo Park, California 94025 USA The EcoCyc database characterizes the known network of Escherichia coli small-molecule metabolism. Here we present a computational analysis of the global properties of that network, which consists of 744 reactions that are catalyzed by 607 enzymes. The reactions are organized into 131 pathways. Of the metabolic enzymes, 100 are multifunctional, and 68 of the reactions are catalyzed by >1 enzyme. The network contains 791 chemical substrates. Other properties considered by the analysis include the distribution of enzyme subunit organization, and the distribution of modulators of enzyme activity and of enzyme cofactors. The dimensions chosen for this analysis can be employed for comparative functional analysis of complete genomes. The genetic complement of Escherichia coli can be char- acterized by a sequence of 4.7 million DNA nucleotides (Blattner et al. 1997). How can we characterize the functional complement of E. coli, and according to what criteria can we compare the biochemical net- works of two organisms? Until recently, these ques- tions have been virtually impossible to answer. How- ever, the advent of the EcoCyc database (Karp et al. 1999a) allows us to address that question for a subset of the E. coli functional complement: the metabolic map, defined as the set of all known pathways, reactions, and enzymes of E. coli small-molecule metabolism (the terms metabolic map and metabolic network are used interchangeably). Herein, we present a computational analysis of global properties of the metabolic network of E. coli, as described by the EcoCyc database (Karp et al. 1999a). EcoCyc was designed for information retrieval by bi- ologists via the WWW. In addition, its highly struc- tured database schema was designed to support com- putation on the EcoCyc data, and our analysis was computed by programs that operate on the EcoCyc data. EcoCyc is the most comprehensive and detailed description of the metabolic map of any organism and is based on information obtained from the scientific literature. This paper characterizes qualitative aspects of the E. coli metabolic network rather than quantitative as- pects. We are interested in the connectivity relation- ships of the network; its partitioning into pathways; enzyme activation and inhibition; and the repetition and multiplicity of elements such as enzymes, reac- tions, and substrates. We do not address quantitative issues such as the fluxes through the network. RESULTS Results Based on the Principal Entities Represented in EcoCyc We begin by describing some of the properties of the principal entities stored in EcoCyc, such as proteins, reactions, metabolites, pathways, and their inter- relationships. We then present results obtained from specific queries about the properties of the metabolic map of E. coli. Proteins, Polypeptides, and Protein Complexes Proteins unify genome structure on the one hand (as products of genes) and biological function on the other (through reactions or pathways). The E. coli genome contains 4391 predicted genes, of which 4288 code for proteins (Blattner et al. 1997). Of these gene products, 676 form 607 enzymes of E. coli small-molecule me- tabolism. Of those enzymes, 311 are protein com- plexes; the remaining 296 are monomers. EcoCyc in- formation about protein complexes contains the com- plete description of their stoichiometry. For example, two nrdA and two nrdB subunits (both genes at min 50.5) form the B1 and B2 complexes of ribonucleoside– diphosphate reductase (EC 1.17.4.1). Protein com- plexes are homo- or heteromultimers. Figure 1 shows the distribution of the subunit composition of the pro- tein complexes. Reactions EcoCyc describes 905 metabolic reactions that are cata- lyzed by E. coli. Of these reactions, 161 are not involved in small-molecule metabolism, for example, they par- ticipate in macromolecule metabolism such as DNA replication and tRNA charging. Of the remaining 744 reactions of small-molecule metabolism, 569 have been assigned to at least one pathway, as shown in Figure 2 (each edge in the graph in Fig. 2 represents a 3 Corresponding author. E-MAIL [email protected]; FAX 44-1223-494471. Resource 568 Genome Research 10:568–576 ©2000 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/00 $5.00; www.genome.org www.genome.org

Upload: others

Post on 25-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Global Properties of the Metabolic Map of Escherichia coli · global properties of the metabolic network of E. coli,as described by the EcoCyc database (Karp et al. 1999a). EcoCyc

Global Properties of the Metabolic Mapof Escherichia coliChristos A. Ouzounis1,3 and Peter D. Karp2

1Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, CB101SD, UK; 2Bioinformatics Group, AI Center, SRI International, EK223, Menlo Park, California 94025 USA

The EcoCyc database characterizes the known network of Escherichia coli small-molecule metabolism. Here wepresent a computational analysis of the global properties of that network, which consists of 744 reactions thatare catalyzed by 607 enzymes. The reactions are organized into 131 pathways. Of the metabolic enzymes, 100are multifunctional, and 68 of the reactions are catalyzed by >1 enzyme. The network contains 791 chemicalsubstrates. Other properties considered by the analysis include the distribution of enzyme subunit organization,and the distribution of modulators of enzyme activity and of enzyme cofactors. The dimensions chosen for thisanalysis can be employed for comparative functional analysis of complete genomes.

The genetic complement of Escherichia coli can be char-acterized by a sequence of 4.7 million DNA nucleotides(Blattner et al. 1997). How can we characterize thefunctional complement of E. coli, and according towhat criteria can we compare the biochemical net-works of two organisms? Until recently, these ques-tions have been virtually impossible to answer. How-ever, the advent of the EcoCyc database (Karp et al.1999a) allows us to address that question for a subset ofthe E. coli functional complement: the metabolic map,defined as the set of all known pathways, reactions,and enzymes of E. coli small-molecule metabolism (theterms metabolic map and metabolic network are usedinterchangeably).

Herein, we present a computational analysis ofglobal properties of the metabolic network of E. coli, asdescribed by the EcoCyc database (Karp et al. 1999a).EcoCyc was designed for information retrieval by bi-ologists via the WWW. In addition, its highly struc-tured database schema was designed to support com-putation on the EcoCyc data, and our analysis wascomputed by programs that operate on the EcoCycdata. EcoCyc is the most comprehensive and detaileddescription of the metabolic map of any organism andis based on information obtained from the scientificliterature.

This paper characterizes qualitative aspects of theE. coli metabolic network rather than quantitative as-pects. We are interested in the connectivity relation-ships of the network; its partitioning into pathways;enzyme activation and inhibition; and the repetitionand multiplicity of elements such as enzymes, reac-tions, and substrates. We do not address quantitativeissues such as the fluxes through the network.

RESULTS

Results Based on the Principal Entities Representedin EcoCycWe begin by describing some of the properties of theprincipal entities stored in EcoCyc, such as proteins,reactions, metabolites, pathways, and their inter-relationships. We then present results obtained fromspecific queries about the properties of the metabolicmap of E. coli.

Proteins, Polypeptides, and Protein ComplexesProteins unify genome structure on the one hand (asproducts of genes) and biological function on the other(through reactions or pathways). The E. coli genomecontains 4391 predicted genes, of which 4288 code forproteins (Blattner et al. 1997). Of these gene products,676 form 607 enzymes of E. coli small-molecule me-tabolism. Of those enzymes, 311 are protein com-plexes; the remaining 296 are monomers. EcoCyc in-formation about protein complexes contains the com-plete description of their stoichiometry. For example,two nrdA and two nrdB subunits (both genes at min50.5) form the B1 and B2 complexes of ribonucleoside–diphosphate reductase (EC 1.17.4.1). Protein com-plexes are homo- or heteromultimers. Figure 1 showsthe distribution of the subunit composition of the pro-tein complexes.

ReactionsEcoCyc describes 905 metabolic reactions that are cata-lyzed by E. coli. Of these reactions, 161 are not involvedin small-molecule metabolism, for example, they par-ticipate in macromolecule metabolism such as DNAreplication and tRNA charging. Of the remaining 744reactions of small-molecule metabolism, 569 havebeen assigned to at least one pathway, as shown inFigure 2 (each edge in the graph in Fig. 2 represents a

3Corresponding author.E-MAIL [email protected]; FAX 44-1223-494471.

Resource

568 Genome Research 10:568–576 ©2000 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/00 $5.00; www.genome.orgwww.genome.org

Page 2: Global Properties of the Metabolic Map of Escherichia coli · global properties of the metabolic network of E. coli,as described by the EcoCyc database (Karp et al. 1999a). EcoCyc

single reaction; >569 edges are contained in Fig. 2, assome reactions are present in multiple pathways). Thenumber of reactions (744) and the number of enzymes(607) differ because (1) there is no one-to-one mappingbetween enzymes and reactions—some enzymes cata-lyze multiple reactions, and some reactions are cata-lyzed by multiple enzymes; and (2) for some reactionsknown to be catalyzed by E. coli, the enzyme has notyet been identified.

The enzyme nomenclature system devised by theInternational Union of Biochemistry and Molecular Bi-ology describes many enzyme-catalyzed reactions from

a wide variety of organisms. Of the 3399 reactions de-fined in the enzyme nomenclature system as definedby version 22.0 of the ENZYME database (Bairoch1999), 604 of these reactions occur in E. coli (18%),meaning that the remaining 301 reactions known tooccur in E. coli do not have assigned EC numbers. Fig-ure 3 shows the breakdown of E. coli reactions into thesix classes of the enzyme–nomenclature system.

CompoundsThe nodes in Figure 2 represent reaction substrates (de-fined as the union of the reactants and products of areaction). The node shapes encode the chemical classof each substrate, for example, squares represent car-bohydrates (for details, see key). Only a subset of allmetabolic substrates is shown in Figure 2, namely themain substrates that are shared among consecutivesteps of a pathway and that lie at the ends of a path-way. Figure 2 does not show the side compounds ofeach reaction (e.g., ADP/ATP in kinase reactions).

The 744 reactions of E. coli small-molecule me-tabolism involve a total of 791 different substrates. Onaverage, each reaction contains 4.0 substrates (Fig. 4).Each distinct substrate occurs in an average of 2.1 re-actions. The number of reactions in which the mostcommon substrates are found are shown in Table 1.

PathwaysAt present, EcoCyc describes 131 pathways, coveringenergy metabolism, nucleotide and amino acid biosyn-thesis, and secondary metabolism (Table 2). Pathwaysvary in length from a single reaction step to 16 steps(see Fig. 5), with an average of 5.4 steps. The eightsingle-reaction pathways represent interconversion re-actions for amino acid biosynthesis and catabolism;they are defined in EcoCyc for completeness. Becausethere is no precise biological definition of a pathway,the decision of how to define specific pathways withinthe database is not a simple one (Karp and Paley 1994).Although most EcoCyc pathways were extracted fromthe experimental literature, there is some latitude onhow to partition the metabolic network into pathways.Therefore, the distribution of pathway lengths is partlyinfluenced by these decisions.

Results Based on Combining Properties of theBasic EntitiesBelow we raise some biological questions and discusssome of the answers we have obtained by investigatingthe relationships between the properties of the basicentities discussed above.

DefinitionsThe analyses in this paper refer to enzymes and reac-tions of small-molecule metabolism. Although the listsof these enzymes and reactions may seem straightfor-ward to generate, the computational definitions of

Figure 1 Organization of protein complexes. The componentcount of a protein complex is the number of different geneswhose products are contained within the complex; the subunitcount of a protein complex is the number of monomers withinthe complex (the subunit count takes into account the coefficientof each monomer within the complex; the component countdoes not). (A) Distribution of subunit counts for all EcoCyc pro-tein complexes: The predominance of monomers, dimers, andtetramers is obvious—five enzymes are not included because thecoefficients of their components are not known. (B) Distributionof subunit counts as a function of component counts for all com-plexes. In all bar diagram figures presented herein, the actualcounts are shown; if the count exceeds the y-axis, it is indicatedby an open face.

E. col i Metabolic Map

Genome Research 569www.genome.org

Page 3: Global Properties of the Metabolic Map of Escherichia coli · global properties of the metabolic network of E. coli,as described by the EcoCyc database (Karp et al. 1999a). EcoCyc

Figure 2 (See facing page for legend.)

Ouzounis and Karp

570 Genome Researchwww.genome.org

Page 4: Global Properties of the Metabolic Map of Escherichia coli · global properties of the metabolic network of E. coli,as described by the EcoCyc database (Karp et al. 1999a). EcoCyc

these concepts are subtle because of the many relatedenzymes and reactions within the EcoCyc databasethat we wish to computationally exclude from thisdefinition. For example, we wish to exclude enzymesof macromolecule metabolism, as well as most reac-tions whose substrates are proteins (such as signal-transduction reactions), and include, for example, re-actions of fatty-acid biosynthesis whose substrates in-clude the acyl-carrier protein. The computationaldefinitions used for these concepts are as follows:

● A pathway, P, is a pathway of small-molecule me-tabolism if it is both (1) not a signal-transductionpathway, and (2) not a super-pathway (EcoCyc su-per-pathways are connected sets of smaller path-ways).

● A reaction, R, is a reaction of small-molecule me-tabolism if either (1) R is a member of a pathway ofsmall-molecule metabolism, or (2) none of the sub-strates of R are macromolecules such as proteins,tRNAs, or DNA, and R is not a transport reaction.

● An enzyme, E, is one of small-molecule metabolismif E catalyzes a reaction of small-molecule metabo-lism.

● A side compound is a reaction substrate that is not amain compound.

● A substrate is a reaction product or reactant.

Enzyme ModulationAn enzymatic reaction is a type of EcoCyc object thatrepresents the pairing of an enzyme with a reactioncatalyzed by that enzyme. Enzymatic reactions are nec-essary representational devices because the informa-tion they contain is specific to neither the individualenzyme nor to the individual reaction but to the pair-ing of the two (Karp and Riley 1993). These objectsencode information such as the activators and inhibi-tors of the enzyme, the cofactors required by the en-zyme, and alternative substrates that the enzyme willaccept—all with respect to a particular reaction.

EcoCyc contains extensive information on themodulation of E. coli enzymes. (A ligand that activatesand/or inhibits an enzyme by directly binding to it—orcompeting with its substrates—is defined as a modula-tor, in the absence of any biological term for this con-cept.) The information lists all compounds that areknown to activate or inhibit an enzyme from in vitroenzymological studies. The database also identifies, foreach enzyme, that subset of its activators and inhibi-tors known to be of physiological significance for E.coli.

Of the 805 enzymatic-reaction objects within Eco-Cyc, physiologically relevant activators are known for22, physiologically relevant inhibitors are known for80, and 17 have both. This means that it is very rare foran enzyme to have only an activator but not an inhibi-tor (five cases). The most frequent activators and in-hibitors are listed in Table 3A. For the enzymatic reac-

Figure 3 The number of EC class reactions present in E. coliagainst the total number of EC reaction types. The blue bars(%ecoli) signify the percent contribution of each class for allknown reactions in E. coli (the seven classes total 100%); thegreen bars (%total) signify the percent coverage of the EC classesin the known reactions in EcoCyc. Due to the apparently finerclassification of classes 1–3, the two measures display an inverserelationship: More reactions in E. coli belong to classes 1–3, al-though they represent a smaller percentage of reactions listed inthe EC hierarchy (and vice versa).

Figure 4 Diagram showing the number of reactions containingvarying numbers of substrates (reactants plus products).

Figure 2 Overview diagram of E. coli metabolism. Each node in the diagram represents a single metabolite whose chemical class isencoded by the shape of the node. Each blue line represents a single bioreaction. The white lines connect multiple occurrences of thesame metabolite in the diagram. (A) This version of the overview shows all interconnections between occurrences of the same metaboliteto communicate the complexity of the interconnections in the metabolic network. (B) In this version many of the metabolite intercon-nections have been removed to simplify the diagram; those reaction steps for which an enzyme that catalyzes the reaction is known tohave a physiologically relevant activator or inhibitor are highlighted.

E. col i Metabolic Map

Genome Research 571www.genome.org

Page 5: Global Properties of the Metabolic Map of Escherichia coli · global properties of the metabolic network of E. coli,as described by the EcoCyc database (Karp et al. 1999a). EcoCyc

tions described, it is known that 327 (almost half) re-quire a cofactor or prosthetic group. The mostcommon cofactors and prosthetic groups are listed inTable 3B.

Protein Subunits and Linked GenesA unique property of the EcoCyc knowledge base isthat it explicitly encodes the subunit organization ofproteins. Therefore, one can ask questions, such as Are

protein subunits coded by neighboring genes? Interest-ingly, this seems to be the case for >90% of knownheteromeric enzymes (data not shown). For instance,genes for imidazole glycerol phosphate synthase sub-units HisH and HisF are located at centisome positions45.053 and 45.081, respectively. An example of a non-neighboring gene pair for subunits for a heteromericenzyme is the pair coding for ArgF and ArgI, subunitsof ornithine carbamoyltransferase, located at centi-some positions 6.356 and 96.421, respectively.

Reactions Catalyzed by More Than One EnzymeSome reactions are catalyzed by more than one en-zyme: Although 592 reactions are catalyzed by a singleenzyme, 55 reactions are catalyzed by two enzymesand 12 reactions are catalyzed by three enzymes (Fig.6). For 84 reactions, the corresponding enzyme is notrecorded in EcoCyc. There is a single E. coli reactioncatalyzed by four enzymes (second part of the mul-tistep synthesis of thiazole moiety of thiamine). Thereasons for this isozyme redundancy can be attributedto two factors: Either the enzymes that catalyze thesame reaction are homologs and have duplicated (orobtained by horizontal gene transfer), acquiring somespecificity but retaining the same mechanism (diver-gence), or the reaction is easily “invented”; therefore,there is more than one protein family that is indepen-dently able to perform the catalysis (convergence). La-bedan and Riley (1995) examined 73 pairs of E. coliisozymes and found that 60% of those pairs had de-tectable sequence similarity.

Enzymes That Catalyze More Than One ReactionVirtually every enzyme whose function has been pre-dicted in a complete genome has been assigned a singleenzymatic function. However, E. coli is known to con-tain many multifunctional enzymes. Of the total of607 E. coli enzymes, 100 are multifunctional, eitherhaving the same active site and different substratespecificities or different active sites. The enzymes thatcatalyze seven and nine reactions (Fig. 7) are, respec-tively, purine nucleoside phosphorylase and nucleo-side diphosphate kinase. The significantly high pro-portion of multifunctional enzymes implies that thegenome projects are significantly underpredictingmultifunctional proteins.

Reactions Participating in More Than One PathwayMetabolic networks are difficult to represent in bio-chemistry textbooks, because their complex relation-ships must be laid out on a two-dimensional chart. Inaddition, the interactions between pathways are suchthat these graphs are difficult to draw clearly. It is ofinterest to determine the common intersections of anapparently loosely connected network of metabolic re-actions.

In EcoCyc, 483 reactions belong to a single path-

Table 1. Most Frequently Used Metabolites in E. coliCentral Metabolism

Occurrence Name of metabolite

205 H2O152 ATP101 ADP100 phosphate89 pyrophosphate66 NAD60 NADH54 CO253 H+

49 AMP48 NH348 NADP45 NADPH44 Coenzyme A43 L-glutamate41 pyruvate29 acetyl-CoA26 O224 2-oxoglutarate23 S-adenosyl-L-methionine18 S-adenosyl-homocysteine16 L-aspartate16 L-glutamine15 H2O214 glucose13 glyceraldehyde-3-phosphate13 THF13 acetate12 PRPP12 [acyl carrier protein]12 oxaloacetic acid11 dihydroxy-acetone-phosphate11 GDP11 glucose-1-phosphate11 UMP10 e1

10 phosphoenolpyruvate10 acceptor10 reduced acceptor10 GTP10 L-serine10 fructose-6-phosphate9 L-cysteine9 reduced thioredoxin9 oxidized thioredoxin9 reduced glutathione8 acyl-ACP8 L-glycine8 GMP8 formate

Metabolites were used either as reactants or products.

Ouzounis and Karp

572 Genome Researchwww.genome.org

Page 6: Global Properties of the Metabolic Map of Escherichia coli · global properties of the metabolic network of E. coli,as described by the EcoCyc database (Karp et al. 1999a). EcoCyc

Table 2. List of All Known E. coli Metabolic Pathways as Described by EcoCyc

(Deoxy)ribose phosphate metabolism3-Phenylpropionate and 3-(3-hydroxyphenyl)propionate degradation4-Aminobutyrate degradationAerobic electron transferAerobic respiration, electron donors reaction listAlanine biosynthesisAnaerobic electron transferAnaerobic respirationAnaerobic respiration, electron acceptors reaction listAnaerobic respiration, electron donors reaction listArginine biosynthesisAsparagine biosynthesis and degradationAspartate biosynthesis and degradationBetaine biosynthesisBiosynthesis of proto- and sirohemeBiotin biosynthesisCarnitine metabolismCarnitine metabolism, CoA-linkedCobalamin biosynthesisColanic acid biosynthesisCyanate catabolismCysteine biosynthesisD-arabinose catabolismD-galactarate catabolismD-galacturonate catabolismD-glucarate catabolismD-glucuronate catabolismDegradation of short-chain fatty acidsDeoxypyrimidine nucleotide/side metabolismDeoxyribonucleotide metabolismdTDP–rhamnose biosynthesisEnterobacterial common antigen biosynthesisEnterobactin synthesisEntner–Doudoroff pathwayFatty acid biosynthesis, initial stepsFatty acid elongation, saturatedFatty acid elongation, unsaturatedFatty acid oxidation pathwayFermentationFolic acid biosynthesisFormylTHF biosynthesisFucose catabolismGalactitol catabolismGalactonate catabolismGalactose metabolismGalactose, galactoside and glucose catabolismGluconeogenesisGlucosamine catabolismGlucose 1-phosphate metabolismGlutamate biosynthesisGlutamate utilizationGlutamine biosynthesisGlutamine utilizationGlutathione biosynthesisGlutathione-glutaredoxin redox reactionsGlycerol metabolismGlycine biosynthesisGlycine cleavageGlycogen biosynthesisGlycogen catabolismGlycolate metabolismGlycolysisGlyoxylate cycleGlyoxylate degradationHistidine biosynthesisHistidine degradation

Isoleucine biosynthesisKDO biosynthesisL-alanine degradationL-arabinose catabolismL-cysteine catabolismL-lyxose metabolismL-serine degradationLactose degradationLeucine biosynthesisLipid A precursor biosynthesisLysine and diaminopimelate biosynthesisMannitol degradationMannose and GDP-mannose metabolismMannose catabolismMenaquinone biosynthesisMethionine biosynthesisMethyl-donor molecule biosynthesisMethylglyoxal metabolismNAD phosphorylation and dephosphorylationNonoxidative branch of the pentose phosphate pathwayNucleotide metabolismO-antigen biosynthesisOxidative branch of the pentose phosphate pathwayPantothenate and coenzyme A biosynthesisPeptidoglycan biosynthesisPhenylalanine biosynthesisPhenylethylamine degradationPhosphatidic acid synthesisPhospholipid biosynthesisPolyamine biosynthesisPolyisoprenoid biosynthesisppGpp metabolismProline biosynthesisProline utilizationPropionate metabolism, methylmalonyl pathwayPurine biosynthesisPyridine nucleotide cyclingPyridine nucleotide synthesisPyridoxal 58-phosphate biosynthesisPyridoxal 58-phosphate salvage pathwayPyrimidine biosynthesisPyrimidine ribonucleotide/ribonucleoside metabolismPyruvate dehydrogenasePyruvate oxidation pathwayRemoval of superoxide radicalsRhamnose catabolismRiboflavin, FMN and FAD biosynthesisRibose catabolismSerine biosynthesisSorbitol degradationSulfate assimilation pathwayTCA cycle, aerobic respirationThiamine biosynthesisThioredoxin pathwayThreonine biosynthesisThreonine catabolismTrehalose biosynthesisTrehalose degradation, low osmolarityTryptophan biosynthesisTryptophan utilizationTyrosine biosynthesisUbiquinone biosynthesisUDP-N-acetylglucosamine biosynthesisValine biosynthesisXylose catabolism

The reactions and enzymes within each pathway can be determined using the EcoCyc WWW server that is available athttp://ecocyc.DoubleTwist.com/ecocyc/.

E. col i Metabolic Map

Genome Research 573www.genome.org

Page 7: Global Properties of the Metabolic Map of Escherichia coli · global properties of the metabolic network of E. coli,as described by the EcoCyc database (Karp et al. 1999a). EcoCyc

way, whereas 99 reactions belong to more than onepathway (Fig. 8). These reactions appear to be the in-tersection points in the complex network of chemicalprocesses in the cell. For example, the reaction presentin six pathways corresponds to the reaction catalyzedby malate dehydrogenase (EC 1.1.1.37), a central en-zyme in cellular metabolism (Goward and Nicholls1994). Another pathway property is the number ofunique substrates per pathway (Fig. 9).

DISCUSSIONThe purpose of this study is to raise questions as well asto answer them. The questions raised are the sort thatshould be posed to the creators of the many full ge-nomic sequences now available for integration of ge-nomic data to obtain systems-level understandings ofthese organisms. Ideally, we would like to have a com-plex model of full metabolic networks that predictsand explains many of the relationships presentedherein. Because no such underlying model exists, wepresent an empirical analysis as a first step toward sucha model.

One result of this study is the identification of anumber of measures for quantifying a metabolic net-work. We sought measures that were well defined, rela-tively easy to compute, and comprehensive. However,we expect that additional important quantities will beidentified in future studies.

Given that many E. coli genes or enzymes are notyet characterized, our results probably represent an un-derestimate of the real metabolic network. As the an-notation of the E. coli genome within EcoCyc ap-proaches completeness, we expect relatively smallchanges (<10%) in the patterns and generalizationspresented here because (1) the experimental informa-tion that already exists for E. coli metabolism is quiteextensive, and (2) although 30% of E. coli genes remain

Table 3. Most Common Modulators, cofactors, and prosthetic groups of E. coli enzymes and Their Frequencies

A. Modulators (activators and inhibitors) B. Cofactors and prosthetic groups

OccurrenceName

of modulator Activator Inhibitor OccurrenceName

of compound CofactorProsthetic

group

35 Cu2+ • 145 Mg2+ • •32 ATP • • 48 pyridoxal 58-phosphate • •30 Zn2+ • • 33 Mn2+ •29 AMP • • 31 FAD • •26 ADP • • 21 Fe2+ • •25 EDTA • • 18 Zn2+ • •23 p-chloromercuribenzoate • 16 thiamine-pyrophosphate •23 pyrophosphate • • 11 FMN • •22 K+ • • 10 Co2+ •22 phosphate • • 9 K+ •20 Hg2+ • 6 Mo2+ •20 Ca2+ • • 5 NAD • •19 N-ethylmaleimide • • 4 protoheme •16 NAD • • 4 Ni2+ • •16 iodoacetamide • 4 Ca2+ •16 coenzyme A • 4 4Fe-4S center •15 Co2+ • • 3 NH4

+ •15 Mg2+ • • 3 pyruvate •15 phosphoenolpyruvate • • 3 siroheme •14 Fe2+ • • 3 cytochrome c •14 GTP • • 2 heme C •14 pyruvate • • 2 B12 •13 p-hydroxymercuribenzoate • 2 NADP •13 NADP • 2 Cu2+ •12 Mn2+ • • 2 biotin •

2 Cd2+ •

Figure 5 Length distribution of EcoCyc pathways; two pathwaysare not included because the number of steps is not known.

Ouzounis and Karp

574 Genome Researchwww.genome.org

Page 8: Global Properties of the Metabolic Map of Escherichia coli · global properties of the metabolic network of E. coli,as described by the EcoCyc database (Karp et al. 1999a). EcoCyc

unidentified, enzymes are the best studied and easilyidentifiable class of proteins. Therefore, we expect rela-tively few new enzymes to be discovered among thesegenes. Because functional information such as path-ways and reactions is much harder to estimate fromgenomic sequence, our reasoning on coverage is domi-nated by the actual numbers of genes and proteinsknown to be present in E. coli.

How will these patterns generalize to other spe-cies? In future work we plan to apply these same mea-sures to full metabolic networks that have been pre-dicted from genome sequences using the PathoLogicprogram (Karp et al. 1996). However, one measure thatwe can estimate now is the frequency of multifunc-tional enzymes within the network. E. coli contains100 multifunctional enzymes (16% of all enzymes).However, the annotators of the full microbial genomes

sequenced to date have identified virtually no multi-functional proteins, suggesting that either E. coli is un-usual in this respect or that the annotations have sys-tematically neglected this aspect of protein function.

This study illustrates the power of the high-fidelityrepresentation of biological function used by the Eco-Cyc database (Karp and Riley 1993). These resultswould be impossible to derive using existing protein ornucleic acid sequence databases because these data-bases represent protein functions using text descrip-tions that cannot be dissected reliably by computerprograms to compute the measures employed herein.For example, because the sequence databases do notencode the subunit organization of a protein precisely,they cannot be used to answer queries, such as Aresubunits of multicomponent protein complexes linkedin the genome?

As we quickly progress into the genome era of ex-perimental and computational biology, systems suchas EcoCyc should prove extremely useful in analyzing

Figure 6 Diagram showing the number of reactions that arecatalyzed by one or more enzymes. Most reactions are catalyzedby one enzyme, some by two, and very few by more than twoenzymes.

Figure 7 Diagram showing the number of enzymes that cata-lyze one or more reactions. Most enzymes catalyze one reaction;some are multifunctional.

Figure 8 Diagram showing the number of reactions that par-ticipate in one or more pathways.

Figure 9 The distribution of the number of substrates in Eco-Cyc pathways. For a definition of a substrate, please see the text(Definitions).

E. col i Metabolic Map

Genome Research 575www.genome.org

Page 9: Global Properties of the Metabolic Map of Escherichia coli · global properties of the metabolic network of E. coli,as described by the EcoCyc database (Karp et al. 1999a). EcoCyc

the functional capacities of model organisms, recon-structing their metabolic pathways from the genomesequence (Karp et al. 1999b), as well as understandingtheir cellular processes.

METHODSAll data analysis was performed using version 4.5 of the Eco-Cyc database, released in September, 1998 (Karp et al. 1999a).Analysis programs were written in Allegro Common Lisp(Franz, Inc.) on a SUN SparcStation, running Solaris 2.x. Theseprograms queried the EcoCyc Knowledge Base as storedwithin the Ocelot frame knowledge representation system,which is also implemented in Allegro Common Lisp.

ACKNOWLEDGMENTSWe thank Dr. Jonathan Wagg (Pangea Systems), Monica Riley(Marine Biological Laboratory), and numerous colleagues atthe European Bioinformatics Institute for comments. Thiswork was supported by grant 1-R01-RR07861-01 from the NIHNational Center for Research Resources. Additional supportwas provided by the Human Frontier Science Program Orga-nization and the European Molecular Biology Laboratory.

The publication costs of this article were defrayed in partby payment of page charges. This article must therefore behereby marked “advertisement” in accordance with 18 USCsection 1734 solely to indicate this fact.

REFERENCESBairoch, A. 1999. The ENZYME data bank in 1999. Nucleic Acids Res.

27: 310–311.Blattner, F.R., G. Plunket, C.A. Bloch, N.T. Perna, V. Burland, M.

Riley, J. Collado-Vides, J.D. Glasner, C.K. Rode, G.F. Mayhew etal. 1997. The complete genome sequence of Escherichia coli K-12.Science 277: 1453–1474.

Goward, C.R. and D.J. Nicholls. 1994. Malate dehydrogenase: Amodel for structure, evolution, and catalysis. Protein Sci. 3:1883–1888.

Karp, P.D. and M. Riley. 1993. Representations of metabolicknowledge. Intell. Syst. Mol. Biol. 1: 207–215.

Karp, P.D. and S.M. Paley. 1994. Representations of metabolicknowledge: Pathways. Intell. Syst. Mol. Biol. 2: 203–211.

Karp, P.D., C. Ouzounis, and S. Paley. 1996. HinCyc: A knowledgebase of the complete genome and metabolic pathways ofHaemophilus influenzae. Intell. Syst. Mol. Biol. 4: 116–124.

Karp, P.D., M. Riley, S.M. Paley, A. Pellegrini-Toole, and M.Krummenacker. 1999a. EcoCyc: Encyclopedia of Escherichia coligenes and metabolism. Nucleic Acids Res. 27: 55–58.

Karp, P.D., M. Krummenacker, S.M. Paley, and J. Wagg. 1999b.Integrated pathway-genome databases and their role in drugdiscovery. Trends Biotechnol. 7: 275–281.

Labedan, B. and M. Riley. 1995. Gene products of Escherichia coli:Sequence comparisons and common ancestries. Mol. Biol. Evol.12: 980–987.

Received October 13, 1999; accepted in revised form February 15, 2000.

Ouzounis and Karp

576 Genome Researchwww.genome.org