heterogeneity of genome and proteome content in bacteria...

24
Theoretical Population Biology 61, 367–390 (2002) doi:10.1006/tpbi.2002.1606 Heterogeneity of Genome and Proteome Content in Bacteria, Archaea, and Eukaryotes Samuel Karlin 1 , Luciano Brocchieri 1 , Jonathan Trent 2 , B. Edwin Blaisdell 1 , and Jan Mra ´ zek 1 1 Department of Mathematics, Stanford University, Stanford, California 94305-2125 E-mail: karlin@math:stanford:edu 2 NASA Ames Research Center, Mail Stop 239-15, Moffett Field, California 94035 Received April 7, 2002 Our analysis compares bacteria, archaea, and eukaryota with respect to a wide assortment of genome and proteome properties. These properties include ribosomal protein gene distributions, chaperone protein contrasts, major variation of transcription/translation factors, gene encoding pathways of energy metabolism, and predicted protein expression levels. Significant differences within and between the three domains of life include protein lengths, information processing procedures, many metabolic and lipid biosynthesis pathways, cellular controls, and regulatory proteins. Differences among genomes are influenced by lifestyle, habitat, physiology, energy sources, and other factors. & 2002 Elsevier Science (USA) 1. INTRODUCTION Based primarily on rRNA sequence criteria, life has been broadly divided into three domains referred to as Eukaryota, Bacteria, and Archaea, which are believed to reflect phylogenetic relationships (Woese et al., 1990; Brown and Doolittle, 1997). This classification system, however, has not gone uncontested and many problems concerning inferences of evolutionary relationships from molecular sequence data have been identified (e.g., for different views, see Gupta, 1998, 2000; Poole et al., 1999; Lopez et al., 1999; Forterre and Philippe, 1999; Brocchieri, 2001; Gribaldo and Philippe, 2002). Further- more, the identity of the three domains and the relationships between them is problematic. For example, the archaea and eukaryota putatively share many genes and proteins involved in information and cellular activity, whereas archaea and bacteria share many morphological structures and ‘‘operational’’ metabolic genes and proteins (Rivera et al., 1998). Other features make the phylogenetic cohesiveness of the three domains uncertain and their classification moot. To investigate the rigor and value of the current classifica- tion system, it is, therefore, of interest to catalog important genes and proteins in each domain and to show strong similarities and differences within and between domains. In this post-genomic era we can, in principle, study the complete inventory of cellular proteins. Five eukaryotic genome sequences are now complete or nearly complete: Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thali- ana, and Homo sapiens. In addition, more than 50 prokaryotic genome sequences are complete. These include human pathogens and microbes of industrial and commercial value. Our comparisons between the eukaryota, archaea, and bacteria can provide insights 367 0040-5809/02 $35.00 # 2002 Elsevier Science (USA) All rights reserved.

Upload: doanh

Post on 11-Jun-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Theoretical Population Biology 61, 367–390 (2002)

doi:10.1006/tpbi.2002.1606

Heterogeneity of Genome and Proteome Content inBacteria, Archaea, and Eukaryotes

Samuel Karlin1, Luciano Brocchieri1, Jonathan Trent2, B. Edwin Blaisdell1, and Jan Mrazek1

1Department of Mathematics, Stanford University, Stanford, California 94305-2125

E-mail: karlin@math:stanford:edu2NASA Ames Research Center, Mail Stop 239-15, Moffett Field, California 94035

Received April 7, 2002

Our analysis compares bacteria, archaea, and eukaryota with respect to a wide assortment of genome

and proteome properties. These properties include ribosomal protein gene distributions, chaperone protein

contrasts, major variation of transcription/translation factors, gene encoding pathways of energy

metabolism, and predicted protein expression levels. Significant differences within and between the three

domains of life include protein lengths, information processing procedures, many metabolic and lipid

biosynthesis pathways, cellular controls, and regulatory proteins. Differences among genomes are

influenced by lifestyle, habitat, physiology, energy sources, and other factors. & 2002 Elsevier Science (USA)

1. INTRODUCTION

Based primarily on rRNA sequence criteria, life hasbeen broadly divided into three domains referred to asEukaryota, Bacteria, and Archaea, which are believed toreflect phylogenetic relationships (Woese et al., 1990;Brown and Doolittle, 1997). This classification system,however, has not gone uncontested and many problemsconcerning inferences of evolutionary relationships frommolecular sequence data have been identified (e.g., fordifferent views, see Gupta, 1998, 2000; Poole et al., 1999;Lopez et al., 1999; Forterre and Philippe, 1999;Brocchieri, 2001; Gribaldo and Philippe, 2002). Further-more, the identity of the three domains and therelationships between them is problematic. For example,the archaea and eukaryota putatively share many genesand proteins involved in information and cellularactivity, whereas archaea and bacteria share many

36

morphological structures and ‘‘operational’’ metabolicgenes and proteins (Rivera et al., 1998). Other featuresmake the phylogenetic cohesiveness of the threedomains uncertain and their classification moot. Toinvestigate the rigor and value of the current classifica-tion system, it is, therefore, of interest to catalogimportant genes and proteins in each domain and toshow strong similarities and differences within andbetween domains.In this post-genomic era we can, in principle, study

the complete inventory of cellular proteins. Fiveeukaryotic genome sequences are now complete ornearly complete: Saccharomyces cerevisiae, Drosophilamelanogaster, Caenorhabditis elegans, Arabidopsis thali-ana, and Homo sapiens. In addition, more than 50prokaryotic genome sequences are complete. Theseinclude human pathogens and microbes of industrialand commercial value. Our comparisons between theeukaryota, archaea, and bacteria can provide insights

7 0040-5809/02 $35.00

# 2002 Elsevier Science (USA)

All rights reserved.

Table I (continued)

Genome Abbreviationa

Agrobacterium tumefaciens AGRTU

Mesorhizobium loti MESLO (Mlo)

Sinorhizobium meliloti SINME

Caulobacter crescentus CAUCR (Ccr)

Rhodobacter sphaeroides RHOSP

Rickettsia prowazekii RICPR (Rpr)

Helicobacter pylori HELPY (Hpy)

Campylobacter jejuni CAMJE (Cje)

Bacillus subtilis BACSU (Bsu)

Bacillus halodurans BACHA

Staphylococcus aureus STAAU

Lactococcus lactis LACLA (Lla)

Streptococcus pyogenes STRPY (Spy)

Streptococcus pneumoniae STRPN

Listeria monocytogenes LISMO

Clostridium acetobutylicum CLOAC

Mycobacterium leprae MYCLE (Mle)

Mycobacterium tuberculosis MYCTU (Mtu)

Mycoplasma genitalium MYCGE (Mge)

Mycoplasma pneumoniae MYCPN

Ureaplasma urealyticum UREUR (Uur)

Synechocystis sp. SYNY3 (Syn)

Deinococcus radiodurans DEIRA (Dra)

Treponema pallidum TREPA (Tpa)

Borrelia burgdorferi BORBU (Bbu)

Chlamydia trachomatis CHLTR (Ctr)

Chlamydia pneumoniae CHLPN (Cpn)

Chlamydia muridarum CHLMU

Aquifex aeolicus AQUAE (Aae)

Thermotoga maritima THEMA (Tma)

Archaea

Halobacterium sp. HALSP (Hbs)

Methanococcus jannaschii METJA (Mja)

Methanococcus voltae METVO

Methanosarcina barkeri METBA

Methanobacterium thermoautotrophicum METTH (Mth)

Archaeoglobus fulgidus ARCFU (Afu)

Pyrococcus horikoshii PYRHO (Pho)

Karlin et al.368

into the workings of living cells by illuminating proteinproperties associated with specific structures andfunctions that will not only have evolutionary con-sequences, but also medical and environmental benefits.

2. GENOMIC FEATURES ANDORGANIZATION

It is widely believed that in most prokaryoticorganisms during exponential growth, ribosomal pro-teins (RP), transcription/translation processing factors(TF), and the major chaperone/degradation (CH) genesfunctioning in protein folding and trafficking are highlyexpressed based on two-dimensional polyacrylamide gelelectrophoresis and mass spectrometry measurementsfor ECOLI (for genome species abbreviations, seeTable I) (see VanBogelen et al., 1996, 1999), for BACSU(see Hecker and collaborators (e.g., Antlemann et al.,1997)), for METJA (Giometti, Argonne National Labs,pers. comm.) and for DEIRA (see Lipton et al., 2002).The three gene classes RP, CH, and TF record similarcodon frequencies which show high codon biasesrelative to the average gene, and the codon usagedifferences among these three gene classes are low(Karlin and Mr!aazek, 2000). Using these genes as a basis,a gene is Predicted Highly eXpressed (PHX) if its codonusage is rather similar to that of the RP, TF, and CHgene classes and deviates strongly from the averagegene of the genome. Using these criteria, PHX genesin most prokaryotic genomes include genes of principalenergy metabolism and often genes of amino acidsnucleotide, and fatty acid biosyntheses. In the cyano-bacterium SYNY3, the primary genes of photosynthesis

TABLE I

List of Species and Abbreviations

Genome Abbreviationa

Bacteria

Escherichia coli K12 ECOLI (Eco)

Escherichia coli O157 } (EcZ)

Salmonella typhimurium SALTY

Pasteurella multocida PASMU

Yersinia pestis YERPE

Vibrio cholerae VIBCH (Vch)

Xylella fastidiosa XYLFA

Haemophilus influenzae HAEIN (Hin)

Buchnera sp. APS BUCSP

Pseudomonas aeruginosa PSEAU (Pae)

Neisseria meningitidis NEIME (Nme)

Pyrococcus abyssi PYRAB (Pab)

Thermoplasma acidophilum THEAC (Tac)

Thermoplasma volcanium THEVO

Pyrobaculum aerophilum PYRAE

Aeropyrum pernix AERPE (Ape)

Sulfolobus solfataricus SULSO

Sulfolobus tokodaii SULTO

Eukaryotes

Homo sapiens HUMAN

Drosophila melanogaster DROME

Caenorhabditis elegans CAEEL

Saccharomyces cerevisiae YEAST

Arabidopsis thaliana ARATH

Trichomonas vaginalis TRIVA

Giardia lamblia GIALA

Entamoeba histolytica ENTHI

aThe SwissProt five-letter abbreviations are used in the text. The three-

letter abbreviations are used in some tables.

Heterogeneity of Genome and Proteome Content 369

are PHX and in methanogens those essential formethanogenesis are PHX.

2.1. Shine–Dalgarno (SD) Sequences and PHXGenes

In prokaryotes, we find a correlation between geneexpression levels and the presence of a Shine–Dalgarno(SD) sequence, which plays an important role intranslation initiation (see below). In bacterial cells,translation initiation is generally considered the rate-limiting step of translation (reviewed in Gold, 1988;Draper, 1996). Initiation of gene translation in manybacteria involves interactions between a conserved SDsequence upstream of the start codon in the mRNA andan equally conserved anti-SD sequence at the 30 end ofthe 16S rRNA. Not all mRNAs possess a recognizableSD sequence, however. The consensus SD sequencefeatures at its core the purine run AGGAGG orGGAGGA, generally traversing positions �10 to �5relative to the start codon and the 16S rRNA gene whichmainly carries the anti-SD sequence CACCTCCTTTCat its 30 end (see Ma et al., 2002 for details). We observedthat the majority of prokaryotic genomes have at leastone copy of the 16S rDNA gene that has the CCTCCTterminal motif.For our purposes, a strong SD sequence refers to the

motif GGAG, GAGG, or sometimes AGGA, alignedwithin the positions �10 to �5 from the gene startcodon. In several genomes, the proportion of PHXgenes and average to low expression level genes withstrong SD sequence has been investigated (Ma et al.,2002). As may be expected, the PHX genes as comparedto genes with an average or low expression level aresignificantly more likely to possess a strong SD motif.This positive correlation between strong SD signalsequences and high expression genes can be found inalmost all bacterial and archaeal genomes, whereas SDsequences do not exist in eukaryotes.

2.2. Unique vs Multiple Origins of Replication

The GC skew (biases in ðG� CÞ=ðGþ CÞ counts)shows a strong difference between archaea and bacteriaprobably due to the existence of unique vs multipleorigins of replication (Mr!aazek and Karlin, 1998; Frankand Lobry, 1999). There is substantial evidence for aprevalence of G in excess of C in the leading strandrelative to the lagging strand in most bacterial genomes.Exceptions include the genomes of SYNY3, DEIRA,THEMA, and all of the archaeal genomes. The countsof ðG� CÞ=ðGþ CÞ show a strong difference between

archaea and bacteria (e.g., Mr!aazek and Karlin, 1998;Frank and Lobry, 1999). For eukaryotic chromosomesincluding the YEAST genome, the chromosomes ofCAEEL, of DROME, of HUMAN, and of ARATHshow no strand asymmetry.What are the possible sources of strand composition

asymmetry? Lobry (1996a, b) was the first to observestrand compositional asymmetry which he associateswith differences in replication, mutation and repairbiases in the leading vs the lagging strand. Francino andOchman (1997) emphasize a mutational bias associatedwith transcription-coupled repair mechanisms anddeamination events. Other possible sources of strandasymmetry include enzyme and architectural asymme-tries at the replication fork, differences of replicationprocessivity (Kunkel, 1992), intergenic differences insignal or binding sites in the two strands, differences ingene density coupled with amino acid and codon biases(Mr!aazek and Karlin, 1998), and dNTP-pool fluctuationsduring the cell cycle (Thomas et al., 1996). Rocha et al.(1999), using a statistical linear discriminant function,observed compositional asymmetries between genes onthe leading vs those on the lagging strand at the level ofnucleotides, codons, and amino acids. The GC skewswitches sign at the origin and terminus of replication inthose bacteria possessing a single origin of replication(oriC) that is subject to bidirectional replication. Otherfactors that may contribute to strand asymmetry includegene function or expression level, operon organization,and differences in single-base or context-dependentmutations.Strand compositional asymmetry, in general, is not

apparent in the genomes of organisms known to possessmultiple origins of replication distributed, on average,every 50 kb: It may, therefore, be surmised that archaealgenomes, which apparently do not show compositionalasymmetry (no GC skew bias), possess multiple replica-tion origins.

2.3. Periodic 30 bp Repeats in Archaea,Thermophiles, and Alkaliphiles

All current archaeal genomes, except Halobacteriumsp., contain one or more unusual clusters of 24–32 bprepeat elements, usually in excess of 40 copies, separatedby 40–60 bp of nonconserved spacers (see Table II). Asimilar repeat arrangement is present in the Gram-negative hyperthermophilic bacteria AQUAE andTHEMA. The function of these repeats is unknown,although it is tempting to speculate that it is related tothe thermophilic lifestyle. However, it is also observed inthe genome of BACHA, a mesophilic bacterium

TABLE II

Periodic 24–32 bp Repeats Prokaryotic Genomes

Genome Repeat sequence(s) Periodicity

(bp)

Repeat count in the genome Number of

clusters

Max. number of

repeats in a

single cluster

Exact Allowing N errors

SULSO CTTTCAATTCCTTTTGGGATTAATC 61–66 151 201 ðN ¼ 5Þ 2 103

CTTTCAATTCTATAAGAGATTATC 61–66 127 224 ðN ¼ 5Þ 4 96

SULTO TCTTTCAATTCCTTTTGGGATTCATC 62–66 44 188 ðN ¼ 5Þ 2 113

ACTTTCAATTCCATTAAGGATTATC 63–67 57 271 ðN ¼ 5Þ 4 96

PYRAE GTTTCAACTATCTTTTGATTTCTGG 65–70 43 45 ðN ¼ 5Þ 3 18

GAATCTTCGAGATAGAATTGCAAG 66–69 81 83 ðN ¼ 5Þ 1 81

AERPE GCATATCCCTAAAGGGAATAGAAAG 63–67 42 42 ðN ¼ 5Þ 1 42

GAATCTTCGAGATAGAATTGCAAG 62–69 36 47 ðN ¼ 5Þ 2 25

METJA RTTAAAATCAGACCGTTTCGGAATGGAAAY 65–73 66 134 ðN ¼ 6Þ 9 25

METTH GTTAAAATCAGACCAAAATGGGATTGAAAT 65–68 171 171 ðN ¼ 6Þ 2 124

ARCFU GTTGAAATCAGACCAAAATGGGATTGAAAG 66–69 107 108 ðN ¼ 6Þ 2 60

PYRHO GTTTCCGTAGAACTAAATAGTGTGGAAAG 65–68 71 111 ðN ¼ 6Þ 3 67

PYRAB GTTCCAATAAGACTAAAATAGAATTGAAAG 67 47 53 ðN ¼ 6Þ 3 27

PYRFU GTTCCAATAAGACTAAAATAGAATTGAAAG 66–68 50 206 ðN ¼ 6Þ 7 50

THEAC GTAAAATAGAACCTTAATAGGATTGAAAG 65–66 46 47 ðN ¼ 6Þ 1 47

THEVO GTTTAAGATGTACTAGTTAGTATGGAAG � 70 33 40 ðN ¼ 6Þ 2 19

THEMA GTTTCAATAMTTCCTTAGAGGTATGGAAAC 65–67 100 113 ðN ¼ 6Þ 8 41

AQUAE GTTCCTAATGTACCGTGTGGAGTTGAAAC 65–67 14 25 ðN ¼ 6Þ 5 5

BACHA GTCGCACTCTTCATGGGTGCGTGGATTGAAAT 65–68 49 95 ðN ¼ 6Þ 5 36

Karlin et al.370

characterized as an extreme alkaliphile bacterial organismliving optimally at pH59:5 and containing a correspond-ing array of repeats. Two current mesophilic methanogens,Methanosarcina mazei and Methanosarcina acetivorans,contain repeats of the kind displayed in Table II.

2.4. Representations of Short Palindromes

Archaeal and bacterial genomes tend to underrepre-sent 4 and 6 bp palindromes (Rocha et al., 2001), seealso Karlin et al. (1992), while eukaryal genomes havemany of these sufficiently short palindromes. This

observation is consistent with the presence of restrictionsystems in prokaryotic genomes but not in eukaryotes.

2.5. Genome Signature Profiles

Early biochemical experiments measuring nearest-neighbor frequencies established that the set of dinu-cleotide relative abundances of dinucleotides (the so-called dinucleotide bias) is a remarkably stable propertyof the DNA of an organism (Josse et al., 1961; Russellet al., 1976). We observed that the dinucleotide biasappears to reflect species-specific properties of DNA

Heterogeneity of Genome and Proteome Content 371

stacking energies, DNA modification, replication, andrepair mechanisms. Dinucleotide biases in a DNAsequence are assessed through the odds ratios rXY ¼fXY =fXfY where fXY is the frequency of the dinucleotideXY and fX is the frequency of the nucleotide X : Fordouble-stranded DNA sequences, a symmetrized versionfrn

XY g is computed from corresponding frequencies ofthe sequence concatenated with its inverted complemen-tary sequence. Our recent studies have demonstratedthat the dinucleotide bias profiles frn

XY g evaluated fordisjoint 50 kb multiple DNA contigs from the sameorganism are approximately constant throughout thegenome and are generally more equal to each other thanthey are to those from 50 kb contigs of differentorganisms (Russell and Subak-Sharpe, 1977; Karlinand Cardon, 1994; Blaisdell et al., 1996; Karlin, 1998).This bias pervades both coding and noncoding DNA(Karlin and Mr!aazek, 1996). Furthermore, relatedorganisms generally have more nearly equal dinucleo-tide biases than do distantly related organisms.These highly stable DNA doublets suggest that there

may be genome-wide factors that limit the composi-tional and structural patterns of a genomic sequence. Inthe absence of strong current selection, the dinucleotidecompositions should be especially conservative andlikely to drift only slowly with time. Dinucleotiderelative abundances capture most of the departure fromrandomness in genome sequences. Overall, the dinucleo-tide, trinucleotide and tetranucleotide relative abun-dances in a genome are highly correlated, indicating thatDNA conformational arrangements are principallydetermined by base-step configurations. On this basiswe refer to the profile frn

XY g of a given genome as its‘‘genomic signature.’’What causes the uniformity of genomic signatures

throughout the genome of an organism? A reasonableexplanation for this constancy of the genomic signaturemay be based on the replication and repair machinery ofdifferent species, which either preferentially generates orpreferentially selects specific dinucleotides in the DNA(Karlin and Burge, 1995; Karlin, 1998). These effectsmight operate through context-dependent mutationrates and/or DNA modifications and through localDNA structures (base-step conformational tendencies).

2.5.1. Dinucleotide relative abundance extremes in

different prokaryotic genomes. Table III presents thefrn

XY g profiles for a broad collection of completeprokaryotic genomes. The dinucleotide relative abun-dance of XY is considered underrepresented when rn

XY40:78 and overrepresented when rn

XY51:23 (cf. Karlin,1998). We observed that the dinucleotide TA is low in

about 75% of all prokaryote genomes (only 14/53genomes (Table III) are in the normal range and theothers are low). Possible reasons for TA underrepre-sentation may relate to the low thermodynamic stackingenergy of TA, which is the lowest among all dinucleo-tides, to the high degree of degradation of UAdinucleotides by ribonucleases in mRNA (Beutler et al.,1989) and/or to the presence of TA as part of specialregulatory signals. In this context, TA underrepresenta-tion may help to avoid inappropriate binding ofregulatory factors.The dinucleotide relative abundance of CG is low in

20/53 genomes, high in 5/53 and normal in 28/53 (TableIII). For example, CG is low or underrepresented in allChlamydia species, whereas CG is high or overrepre-sented in a- and b-proteobacterial genomes (exceptCAUCR) and mostly normal in g-proteobacteria. CG islow in 7/11 archaea with HALSP high. The dinucleotideGC is high in many g- and e-proteobacterial genomesand also in the low C+G Gram-positive LISMO,STAAU, and CLOAC genomes. Underrepresentationof the dinucleotide CG is prominent in vertebratesequences but not observed in most invertebrates (e.g.,in insects and worms); CG is also significantly low inmany protist genomes (e.g., Entamoeba histolytica,Dictyostelium discoideum, Plasmodium falciparum butnormal in Trypanosoma sp. and Tetrahymena sp.) and isalso low in animal mitochondrial genomes (Cardonet al., 1994) and in most small eukaryotic viral genomes(Karlin et al., 1994). CG is underrepresented inMYCGE, but not in MYCPN and tends to besuppressed in low G+C Gram-positive genomes; CGis low in BORBU and in many thermophilic prokar-yotes, including METJA, METTH, SULSO, PYRHO,and Thermus sp., but not in the thermophiles PYRAE,AQUAE, or THEMA. Connections between CGrepresentations have been made to the immune systemstimulations; see Krieg (1996) and Krieg et al. (1998). Itis clear that mechanisms underlying biased CpGrepresentations vary.

2.5.2. Genome signature differences among sequences. Ameasure of the genomic difference between twosequences f and g (from different organisms or fromdifferent regions of the same genome) is the dinucleotideaverage absolute relative abundance difference calcu-lated as dnðf ; gÞ ¼ ð1=16Þ

PXY jrn

XY ðf Þ � rnXY ðgÞj; where

the sum extends over all dinucleotides and usually dn isaveraged between all pairs of 50 kb contigs of thesequences composing f and g: For convenience, wedescribe levels of dn-differences for some examples (allvalues multiplied by 1000): ‘‘close’’ indicates dn445

HUMANDROMECAEELYEASTARATH

PYRABPYRHOARCFUTHEACTHEVOMETTHMETJAHALSPAERPEPYRAESULSO

ECOLIHAEINPSEAEBUCSPVIBCHPASMUXYLFANEIMEAGRTUAGRTU-LCAUCRMESLOSINMESINMEpARICPRHELPYCAMJECHLTRCHLPNCHLMUBORBUTREPADEIRATHEMAAQUAEMYCTUMYCLEBACSUBACHALACLASTAAUSTRPNSTRPYLISMOCLOACUREURMYCGEMYCPNSYNY3

Genome CG

0.260.930.980.800.72

0.710.610.780.910.830.510.321.360.700.970.67

1.161.091.100.871.041.071.011.311.241.221.161.231.291.230.770.930.620.790.730.750.481.081.070.920.871.181.121.041.090.770.940.690.711.110.450.880.390.820.75

GC

1.011.281.041.020.92

0.890.891.021.041.120.761.130.940.961.150.95

1.281.431.171.251.301.301.181.281.211.201.111.211.151.151.531.561.751.121.061.121.471.221.160.690.751.071.071.271.081.191.251.051.191.281.281.421.191.141.02

TA

0.730.740.610.770.75

0.890.900.610.820.930.740.830.541.211.081.00

0.750.750.540.850.690.760.680.640.470.480.450.440.470.520.980.730.770.770.780.750.770.740.490.500.820.570.750.650.690.670.850.720.770.770.930.790.750.770.75

AT

0.870.970.860.940.91

0.900.920.861.261.071.130.940.980.950.930.95

1.100.951.170.950.990.971.131.041.371.391.291.411.391.280.980.860.830.890.880.880.880.930.890.830.661.251.101.020.980.881.000.890.890.920.950.930.770.711.00

CCGG

1.251.081.051.061.05

1.221.301.041.051.121.251.380.781.141.101.24

0.911.010.841.220.881.010.910.970.860.870.850.810.820.841.031.171.111.011.051.081.290.860.870.991.240.870.880.971.001.050.951.031.040.991.221.081.131.121.36

AATT

1.131.241.281.141.13

1.121.111.210.971.040.951.140.920.881.181.04

1.211.251.071.141.201.231.101.451.261.241.091.171.181.151.051.371.251.161.141.191.221.181.251.191.291.061.041.241.211.231.091.151.171.231.081.171.231.301.32

ACGT

0.820.840.840.890.91

0.770.730.770.750.780.850.721.260.830.830.85

0.880.850.860.810.890.900.960.840.750.760.860.790.770.810.860.670.710.760.770.740.690.960.930.870.891.051.050.750.840.820.950.850.870.860.810.890.961.020.79

CATG

1.201.121.081.101.10

0.850.851.011.060.991.171.030.920.900.850.88

1.121.121.101.021.171.141.261.011.041.051.011.050.931.001.020.971.030.960.960.961.021.131.120.970.741.111.141.081.031.131.181.101.121.041.021.181.161.081.05

AGCT

1.180.870.900.991.03

1.211.221.170.971.051.071.110.791.311.071.17

0.820.821.020.940.900.810.830.690.820.820.960.870.890.911.060.971.091.181.191.151.070.941.001.111.180.790.860.910.910.970.881.081.040.891.130.841.060.960.85

GATC

0.980.901.111.061.11

1.151.131.191.171.061.141.051.331.030.911.05

0.920.871.101.020.950.890.940.901.141.141.221.181.281.220.910.870.921.151.151.121.010.951.011.401.121.081.021.061.101.050.951.090.990.970.970.940.890.810.86

TABLE III

Symmetrized Genome Signatures (rn Dinucleotide Relative Abundance Values)

Significantly underrepresented dinucleotides ðrn40:78Þ are shown in red (bold face if

rn40:70). Significantly overrepresented dinucleotides ðrn51:23Þ are shown in green (bold

face if rn51:30).

Karlin et al.372

Heterogeneity of Genome and Proteome Content 373

(e.g., human vs cow, E. coli vs S. typhimurium),‘‘moderately similar’’ indicates 554dn485 (e.g., humanvs chicken, E. coli vs H. influenzae), ‘‘weakly similar’’indicates 904dn4120 (e.g., human vs sea urchin, M.genitalium vs M. pneumoniae), ‘‘distantly similar’’indicates 1254dn4145 (e.g., human vs Sulfolobus, M.jannaschii vs M. thermoautotrophicum), ‘‘distant’’ in-dicates 1504dn4180 (e.g., human vs Drosophila, E. colivs H. pylori), ‘‘very distant’’ indicates dn5190 (humanvs E. coli, M. jannaschii vs Halobacterium sp.).How does within-species compare to between-species

average dn-differences? Average within-prokaryoticgroup dn-differences range from 12 to 52 (persistentlysmall), whereas the average between-group dn-differ-ences range from 26 to 357 (see Fig. 1).What are the possible mechanisms underlying the

signature determinations? DNA participates in multipleactivities including genome replication, repair, andsegregation, and provides special sequences for encodinggene products. There are fundamental differences inreplication characteristics between Drosophila andmouse (Blumenthal et al., 1974). Drosophila DNAreplicates frenetically in the first hour after fertilization,with replication bubbles distributed about every 10 kb:At about 12 h effective origins are spread to about 50 kbapart. In mouse, the rate of replication appears to beuniform throughout developmental and adult stages.Cell divisions involve DNA stacking on itself andloopouts that need to be judiciously decondensed toundergo segregation. The observed narrow limits tointragenomic heterogeneity putatively correlate withconserved features of DNA structure.The influence of the (double-stranded dinucleotide)

base step on DNA conformational preferences isreflected in slide, roll, propeller twist, and helical twistparameters (Calladine and Drew, 1992; Hunter, 1993).Calculations and experiments both indicate that thesugar–phosphate backbones are relatively flexible. How-ever, base sequence influences flexural properties ofDNA and governs its ability to wrap around histonecores. Moreover, certain base sequences are associatedwith intrinsic curvature, which can lead to bending andsupercoiling. Inappropriate juxtaposition or distributionof purine and pyrimidine bases could engender stericclashes (Calladine and Drew, 1992; Hunter, 1993). Forexample, transient misalignment during replication isassociated with structural alterations of the backbone inalternating purine–pyrimidine sequences. On the otherhand, purine and pyrimidine tracts have less stericconflicts between neighbors (Kunkel, 1992). Dinucleo-tide relative abundance deviations putatively reflectduplex curvature, supercoiling, and other higher-order

DNA structural features. Many DNA repair enzymesrecognize shapes or lesions in DNA structures morethan specific sequences (Echols and Goodman, 1991).DNA structures may be crucial in modulatingprocesses of replication and repair. Nucleosome posi-tioning, interactions with DNA-binding proteins, andribosomal binding of mRNA appear to be stronglyaffected by dinucleotide arrangements (Calladine andDrew, 1992).A central unresolved problem concerns whether

archaea are monophyletic or polyphyletic. On the basisof rRNA gene comparisons, the archaea are deemedmonophyletic (Woese et al., 1990). This conclusion issupported by some protein comparisons, e.g., for thearchaeal RecA-like sequences of Rad 50/Dmc1/RadA(Sandler et al., 1996) and the elongation factor EF-1aand EF-G families (Creti et al., 1994). However, manyprotein comparisons challenge the monophyletic char-acter of the archaea. For example, bacterial relation-ships based on comparisons among the HSP 70 kDsequences place the Halobacteria closer to the Strepto-myces than to archaeal or eukaryotic species (Karlin andCardon, 1994). Other examples are glutamate dehydro-genase and glutamine synthetase (Benachenhou-Lahfaet al., 1994; Brown et al., 1994). Lake and collaboratorssplit the prokaryotes into eubacteria, euryarchaea, andeocytes. With respect to genomic signature comparisons,the closest to HALSP are Streptomyces sequences dn ¼differences about 85, and next but twice as distant areM.tuberculosis and M. leprae sequences: dn (HALSP,MYCTUÞ ¼ 145; dn (HALSP, MYCLEÞ ¼ 155: The dn-differences of HALSP to the archaeal sequences ofSulfolobus sp. and M. jannaschii are very distant: dn >280 and > 340; respectively. Sulfolobus sp. sequences aremoderately similar only to Clostridium acetobutylicum,dn ¼ 87: All other comparisons with Sulfolobus sp. havedn values > 120 and mostly > 180: dn-differences ofHALSP from other archaea exceed 245. Thus, acoherent description for the archaea is not supportedby dn-difference data.Should rickettsial sequences be grouped with a-

proteobacteria? The classical a-types consist of twomajor subgroups: A1 including Rhizobium sp. andAgrobacterium tumefaciens, and A2 including R. capsu-latus and R. sphaeroides, found predominantly in soiland/or marine habitats, the latter capable of anoxygenicphotosynthesis. A third tentative group A3 includes theRickettsial and Ehrlichial clades (obligate intracellularparasites). The genome signature sequence comparisonsindicate pronounced discrepancies between A1 and A2 vsA3: First, the A1 and A2 genomes are pervasively of highG+C content (generally 560%), whereas A3 genomes

FIG. 1.

Karlin et al.374

are of low Gþ C content (535%). Second, the dn-differences among A1 sequences are 35–63 andamong the A2 sequences are 65–90. The dn-differencesbetween A1 and A2 sequences have the range 74–102.By contrast, the RICPR genomes compared to A1

and A2 show excessive dn-differences, generally> 200:Some additional challenging observations based on

signature differences are: (i) All Chlamydia genomes areclose in genome signature to the ARCFU genome. (ii)The enterobacteria ECOLI, HAEIN, VIBCH, andPASMU are intriguingly moderately similar to theDrosophila genome. (iii) In terms of dn-differences, thethree cyanobacterial DNA sequences are not close. Thecyanobacteria Synechocystis, Synechococcus, and Ana-baena do not form a coherent group and are as far fromeach other as general Gram-negative sequences are fromGram-positive sequences. (iv) There is no consistentpattern of dn-differences among thermophiles. Moregenerally, grouping of prokaryotes by environmentalcriteria (e.g., habitat properties, osmolarity tolerance,chemical conditions) reveals no correlations in genomicsignature.

2.5.3. Genome signature comparisons among eukaryo-

tes. (i) The most homogeneous genomes occur amongfungi, especially for S. cerevisiae, whereas the mostvariable genomes are found among protists. Thedistribution of the dn-differences between human andmouse sequence samples is only slightly shifted relativeto dn-differences within human sequence samples,reflecting moderate similarity of human and mouse.On the other hand, the dn-differences between humanand S. cerevisiae and between human and D. melanoga-ster are substantially higher than all within-speciesdn-differences.

3. PROTEOMIC FEATURES

3.1. Chaperone Protein Contrasts

Molecular chaperone systems have evolved in all threedomains of life or originated in a common ancestor.Chaperones are considered to play pivotal roles inprotein folding, degradation of misfolded proteins,

Heterogeneity of Genome and Proteome Content 375

proteolysis, and translocation across membranes. Spe-cialized complex structures in cells often need their own‘‘dedicated’’ chaperones (e.g., Kuehn et al., 1993).Among the top PHX genes in most bacterial genomesare those for the major chaperone proteins, DnaK(HSP70) and GroEL (HSP60). The HSP60 chaperonincomplex is considered to assist protein folding byproviding a cavity in which non-native polypeptidesare enclosed and thereby protected against intermole-cular aggregation (for a review, see Hartl and Hayer-Hartl, 2002). The ATP-regulated DnaK together with itscofactors DnaJ and GrpE and the ATP-independenttrigger factor (Tig) are reported to act co-translationallyto assist in protein folding. Tig and DnaK are proposedto cooperate in the folding of newly synthesized proteins(Teter et al., 1999). Simultaneous deletion of both Tigand DnaK is lethal under normal growth conditions(Deuerling et al., 1999). Tig is broadly PHX for bacterialgenomes but is not found in archaeal or eukaryoticgenomes. HSP70 is abundant in many eukaryota andbacteria, often with multiple copies of the gene, but thegene is not present at all in some archaea. In particular,the HSP70 gene is absent from METJA, ARCFU,PYRAB, PYRHO, PYRAE, SULSO, AERPE butpresent in the archaea (METTH, THEAC, HALSP),where it may have been acquired by lateral transfer. Allarchaea and eukaryota apparently contain the molecu-lar chaperone prefoldin or GimC (genes involved inmicrotubule biogenesis), which is absent from bacteria.GimC is believed to perform HSP70-like functionsalthough there is no sequence similarity between GimCand HSP70 (Siegert et al., 2000).The chaperonin and its co-chaperonin (GroEL/

GroES) are seen to be highly expressed in virtually allbacterial genomes (see Table IV) (Houry et al., 1999),but found to be absent in the Mycolplasma UREUR,which lacks both GroEL and GroES. GroES is notfound in archaea. The archaeal thermosome (a distanthomolog of GroEL) is highly expressed in archaea(Karlin and Mr!aazek, 2000) and a more closely relatedhomolog to the eukaryotic protein TCP1, now referredto as TriC or CCT. The HSP60s in all three domains arepurified from cells as double-ring complexes. Inbacteria, each ring of GroEL contains seven HSP60subunits, while in archaea, each ring contains eight ornine HSP60 subunits. The subunits comprising the ringmay be identical for up to eight different, but closelyrelated HSP60s. In most bacteria, the subunits areidentical, except for rhizobium a-proteobacteria wherethere are two subunits. In some archaea rings areformed from identical subunits, while in others there aretwo subunits, and so far only among the Sulfolobus sp.

there are three subunits. Yeast contains at least 11distinctive TriC genes. It is believed that the eukaryoticring structures contain six to nine different subunits, butit is not yet clear how the different protein subunits arearranged.Duplicated HSP60 sequences stand out among the

classical a-proteobacteria, contrasted to no duplicationsof HSP60 in all other proteobacterial clades. MultipleHSP60 sequences also exist in cyanobacteria, inChlamydia, in high Gþ C Gram-positive, but not inRICPR. Many a-mitochondrial eukaryotes (TRIVA,GIALA, ENTHI) contain two or more HSP60. Plastidscarry multiple copies of HSP60 that bind Rubisco.The bacterial Thioredoxin (TrxA) implements protein

folding by catalyzing the formation or disruption ofdisulfide bonds (Powis and Montfort, 2001; Ritz andBeckwith, 2001). The eukaryotic thioredoxin functionalanalog is protein disulfide isomerase, operating in theendoplasmic reticulum. The highest expression levels forthioredoxin occur in BACSU and then in other fast-growing bacteria in the order DEIRA, VIBCH,HAEIN, and ECOLI (data not shown). Other chaper-ones are also distributed unevenly through the threedomains. HSP90 exists in many bacteria and eukaryotesbut is not found in archaea.Peptidyl-prolyl cis–trans isomerases (PPIases) accel-

erate the proper folding of proteins by promoting thecis–trans isomerization of imide bonds in proline withinoligopeptides. Tig exhibits PPIase activity in vitro.ECOLI has at least nine PPIases defined by sequencesimilarity. One of these, the survival protein SurA,enhances the folding of periplasmic and outer mem-brane proteins. As expected, SurA does not exist inGram-positive bacteria. DegP is a chaperone foldingfactor that is significantly PHX, and acts primarily indegrading misfolded proteins in the periplasm. Alsoassociated with periplasmic and cytoplasmic chaperonesare several PPIases and disulfide oxidase, DsbA.GroEL (and thermosomes in archaea) is PHX

(expression level EðgÞ51) in almost all prokaryoticgenomes as displayed in Table IV. Many HSP60 familymembers are among the top PHX with EðgÞ values oftenexceeding 2.00.

3.2. Ribosomal Protein (RP) Gene Distribution

Many RP genes have diverged between most archaealand bacterial genomes (see COGs database at NCBI-Bethesda, Tatusov et al., 2001). Most bacterial genomeshave an operon or cluster, accounting for 20–40% of allRP genes, are located close to their origin of replication.

TABLE IV

Chaperonin (HSP60) Expression Levels

Proteobacteria

g-clade: ECOLI, EðgÞ ¼ 2:09; SALTY, 2.35; VIBCH, 1.31; HAEIN, 1.47; PSEAE, 1.37; YERPE, 2.08; BUCSP, 1.13

b-clade: NEIME, 1.15.

a-clade: AGRTU, 2.07; SINME (5 copies), 1.66, 0.60, 1.67, 0.66, 1.13; MESLO (5 copies), 1.51, 1.96, 1.87, 1.73, 2.03; CAUCR, 1.78; RICPR, 1.01

e-clade: HELPY, 1.17; CAMJE, 1.43.

Only a-types among proteobacteria carry multiple copies of GroEL and many are very highly expressed.

Other Gram-negative: DEIRA, 2.35; cyanobacteria SYNY3 (2 paralogs), 1.51, 1.47; Nostoc sp., 1.59, 1.38.

Small pathogenic: CHLTR (2 copies), 0.87, 1.16; CHLPN (3), 0.89, 0.73, 1.18; TREPA, 1.27; BORBU, 1.13; MYCGE, 0.82

UREUR has no GROEL.

Gram-positive

Low Gram+: LISMO, 1.89; LACLA, 1.23; STRPY, 0.86; STRPN, 1.08; BACSU, 1.87; BACHA, 1.79.

High Gram+: MYCTU (2 copies), 1.39, 0.95; MYCLE (2), 1.60, 1.30.

Thermosomes in archaea are generally structured from two rings of a and b units. All units are PHX.

HASLP, b 1.20, a 1.21; ARCFU, b 1.35, a 1.49; METJA (a single unit), 1.56; METTH, 1.33, 138; THEAC, a 1.07, b 1.18; PYRHO (only 1 unit),

1.27; PYRAB (only 1 unit), 1.40; PYRAE, 1.48, 1.63; AERPE, 1.14, 1.21; SULSO (3 units a; b; g), 1.25, 1.37, 1.03.

EðgÞ refers to the formal predicted expression level (Karlin and Mr!aazek, 2000).

Karlin et al.376

Many genes involved in protein synthesis including tuf,fus, rpoA, rpoB, rpoC are encoded within or proximal tothe large RP cluster in bacteria but not in archaea.Archaeal genomes, apparently lacking a unique origin ofreplication, contain approximately the same numbers ofRP genes (range 50–65) as bacterial genomes andpossess a less significant operon as in bacterial genomes.In contrast, the RP genes of yeast (and of highereukaryotes) are randomly dispersed throughout theirgenome.A ‘‘giant’’ RP gene (labeled S1), commonly exceeding

500 aa length, is essential in bacterial genomes, (with theexception of Mycoplasma) where it is encoded awayfrom the RP cluster, but is missing from archaeal andeukaryote genomes. S1 is overall acidic, binds weaklyand reversibly to the small subunit of the ribosome,whereas most other RPs bind strongly (Sengupta et al.,2001). S1 has a high affinity for interaction with mRNAchains. Protein S1 is the largest RP present in theribosome of most bacterial cells and consists of multipletandem structural motifs each of about 70 aa length.The S1 protein is reported to be necessary in many casesfor translation initiation and translation elongation andis directly involved in the process of mRNA recognition

and binding. S1 can facilitate binding of mRNA thatlacks a strong Shine–Dalgarno sequence. S1 is notencoded near any RP operon and generally is foundamong the highest expression levels. S1 is also encodedby the deeply branching Gram-negative AQUAE andTHEMA, the latter allowing for a frame shift. The 820aa S1 protein of THEMA can be recognized as a fusionof cytidylate kinase (contributing to pyrimidine bio-synthesis) with a standard S1. The S1 proteins ofBacillus genomes (BACSU, BACHA) and of most lowG+C Gram-positive genomes are generally of reducedsize (in the range 380–407 aa).Bacterial and archaeal genomes encode about 50–65

RPs (to date, the highest number among prokaryoticgenomes is 65 in SULSO), whereas metazoan eukaryotesinvariably have 78 or 79 RP components (Warner,1999). The situation for protozoa may be different.Ribosomal proteins are generally cationic (mostly > 20%cationic residues). Three acidic RPs are found ineukaryotes, P0; P1; P2; each containing a carboxylhyperacidic residue run. Of these, only P0 is present inarchaeal genomes. P0; P1; and P2 are considered to playan important regulatory role in the initiation step ofeukaryotic protein translation. Acidic RPs are not

Heterogeneity of Genome and Proteome Content 377

present in bacterial genomes, except for S1 and L7/L12.L7/L12}as with P0; P1; and P2}is thought to act inadapting mRNA chains to the ribosome.

3.3. Special Transcription/Translation/Replication Factors

The special eukaryotic ancillary replication proteinPCNA is extant in most archaea and eukaryotes but isnot found in bacteria. Actually, there are multiple copiesof the PCNA gene in the crenarchaeal genomes ofSULSO, PYRAE, and AERPE. Translation elongationfactors (e.g., Tuf, Fus) occur as single genes in archaeabut generally appear in multiple highly expressed genesin a-, b-, and g-proteobacteria. The ribosome releaseelongation factor Rrf is found in most bacteria and inyeast, but is missing from archaea. The helicase proteinRecG, which helps facilitate branch migration of theHolliday junction, is widespread in bacteria but see-mingly absent from archaea (Suyama and Bork, 2001).

3.4. Origin and Function of Membrane Lipids

All the three domains contain polyisoprenes buteukaryotes use significant amounts of sterols not presentin either bacteria or archaea. Membranes of Gram-negative bacteria and eukaryotes are replete withphospholipid and lipid-modified proteins, whereasarchaea generally emphasize prenylated ether lipidsbut apparently make no fatty acids (Hayes, 2000).Lipopolysaccharide biosynthesis genes of anomalous

codon usage, which encode a hierarchy of surfaceantigen proteins (the Lps family) and often occur inclusters, are present in many bacterial and in archaealgenomes but never in Gram-positive organisms andapparently are not present in eukaryotes. The Lpsbiosynthesis genes generally indicate a putative aliengene cluster as characterized in Karlin (2001). The lipid-A anchor (connecting sugar and lipid moieties) promi-nent in ECOLI and SALTY appears to be missingfrom Gram-positive and archaeal genomes. This phe-nomenon may be related to the fact that the enzymaticapparatus for lipid synthesis appears to be muchreduced or nonexistent in most archaeal genomes. Forexample, FabB, FabD, and AcpP are not found in thearchaea.

3.5. Nitrogen Fixation (Nif)

Nif genes are present in several bacteria and archaeabut apparently not in eukaryotes. The glnB family ofnitrogen sensory–regulatory genes is widespread in

bacteria and archaea. Nif in archaea is evolutionarilyrelated to Nif genes in bacteria and operates by the samefundamental mechanism (Leigh, 2000). It is proposedthat some genes of this kind wander about via lateraltransfer (e.g., as occurs in Klebsiella). Some Nif genesare found in AQUAE, ARCFU, CHLTR, CHLPN,DEIRA, HAEIN, HELPY, METTH, NEIME,SYNY3, TREPA, VIBCH, SINME. For example, thepredominant nitrogenases in methanogens seem to bemolybdenum (cofactor) nitrogenases as is the case inbacteria. The methanogens vary with respect tonitrogen fixation. For example, neither METJA norMETVO fixes nitrogen while Methanosarcina barkeriand Methanococcus thermolithotropicus both do(Leigh, 2000).

4. METABOLIC PATHWAYS AND SOMEPROTEIN CLASSES

We describe in Tables V–VIII the status of genes ofseveral pathways among archaeal and bacterial speciesemphasizing the presence, absence, and expression levelsof these genes. (EðgÞ signifies the formal predictedexpression level, see Karlin and Mr!aazek (2000)).

4.1. Glycolysis (Table V)

Hexokinase (Hxk) and glucokinase (Glk) are promi-nent glycolysis proteins in eukaryotes, but the former isnot found in most prokaryotes nor in any archaealsequences to date. Only TREPA contains a hexokinasehomolog of low expression level. In glycolysis, hexoki-nase converts glucose to glucose-6-phosphate. However,glucose-6-phosphate arises from other hexoses and fromglucose transported into the cell via the phosphotrans-ferase system (PTS). Glucokinase occurs in manybacteria but normally at low to moderate expressionlevels. Bacteria which rely on carbohydrates as aprimary energy source (STRPY, LACLA, BACSU,ECOLI, VIBCH) use the PTS system to transportglucose into the cytoplasm, which concomitantly phos-phorylates glucose making Hxk/Glk expendable. PTSgenes are present but generally not PHX in PSEAE,HAEIN, NEIME, MESLO, CAUCR, CHLTR,CHLPN, TREPA, MYCGE. PTS genes are absentfrom other current bacteria, from all current archaea,and from yeast. Bacteria mostly execute completeglycolysis cycles (apart from Hxk/Glk) and glycolysisenzymes tend to be PHX with very high expression

Karlin et al.378

levels prevailing in yeast, LACLA, STRPY, andECOLI.The human obligate intracellular parasite RICPR is

not able to metabolize glucose (Winkler and Daugherty,1986). However, RICPR contains five ATP–ADPexchange translocase genes. These antiporters takeATP from host cytoplasmic sources and release ADPfrom the bacterial cell; the standard mitochondrialexchange is reversed. The ATP–ADP translocase is veryuncommon among bacteria and only identified inChlamydia and Rickettsia and in an assortment of plantplastids.The glycolysis genes are PHX in all fast-growing

bacteria with high expression levels for most of them(Karlin et al., 2001). Glycolysis genes in archaea areeither not PHX or entirely missing. For example,glucose-phosphate isomerase is missing from the archae-on ARCFU, from Pyrococcus species, as well as frombacteria in the Mycoplasmas group. Phosphofruktoki-nase is absent from archaea and several proteobacteria(see Table V). There are two types of fructose bipho-sphate aldolase proteins. The class-II Fba is present inall bacteria (except RICPR, HELPY, CAMJE, UR-EUR, and Chlamydia) and also in yeast but has nohomologs in archaea. Chlamydia possess a class-I Fbawhich also carries homologs in most archaea, in ECOLI,in MESLO, and in AQUAE. Phosphoglycerate mutase(Gpm) is present in all bacteria (except RICPR) and inyeast but only HALSP and THEAC carryGpm homologs among archaea. Apart from Rickettsia,the glycolysis enzymes triosephosphate isomerase(Tpi), glyceraldehyde-3-phosphate dehydrogenase(Gap), phosphoglycerate kinase (Pgk), Enolase (Eno),and Pyruvate kinase (Pyk) are widespread in allprokaryotes. The multifunctional enzyme Gap ismissing from UREUR, and pyruvate kinase (Pyk) ismissing from ARCFU, METTH, AQUAE, HELPY,and TREPA. Precluding hexokinase (Hxk) andglucokinase (Glk), there are 10 major glycolysis genes(Table V). Bacterial genomes with at least six PHXglycolysis genes include LACLA (has nine PHXglycolysis genes), STRPY (9), BACSU (7), SYNY3 (6),LISMO (8), ECOLI (10), VIBCH (8), HAEIN (7),MESLO (6). There are no archaeal genomes withmore than three (mostly one) glycolysis genesPHX.

4.2. Amino acyl-tRNA Synthetases(Table VI, aaRS)

The picture of aaRS proteins has become rathercomplex (Handy and Doolittle, 1999; Wolf et al., 1999).

The existence of two classes of aaRS proteins has beenfirmly established during the past decade. Evidencecomes from X-ray structural data, sequence compar-isons, and enzymatic mechanisms. The correspondingamino acids divide into two sets of residues: Leu, Ile,Met, Val, Tyr, Gln, Glu, Arg, Cys, Trp constitute ClassI, whereas Class II amino acids are Gly, Ala, Ser, Asp,Asn, Lys, His, Pro, Thr, Phe. Most aaRS are present inall genomes. However, glutaminyl-tRNA synthetase(GlnS) is missing from all archaea and most bacteria.In fact, it is present only in g-proteobacteria, andDEIRA. In other species, GlnS is complemented byGluS and an amidotransferase. Asparaginyl-tRNAsynthetase (AsnS) is absent from many prokaryoticgenomes. Among archaea, AsnS is found in bothThermoplasma and both Pyrococcus sequences. Amongbacteria AsnS occurs in several g-proteobacteria, inseveral low G+C Gram-positive bacteria, in spirochetesand in mycoplasmas. Glycyl-tRNA synthetase in manybacteria is composed of two subunits GlyS and GlyQwhereas the archaea have a single unit GRS1. Interest-ingly, DEIRA, mycobacteria, spirochetes, and myco-plasmas also have the archaeal GRS1 instead of thebacterial type. Analogously, for LysS Class I lysyl-tRNA synthetase (LysU) occurs in most bacteria and,notably, in yeast. Class II lysyl-tRNA synthetase (LysS)is found in all archaea, and in a-proteobacteria(including RICPR) and spirochetes.Among bacteria, the number of PHX aaRS varies

from zero in DEIRA, PSEAE and several other speciesto 19 in ECOLI. Archaea are incongruent with no aaRSreaching PHX status in METJA but 13 PHX inAERPE. In the yeast genome, 13 aaRS are PHX. Mostyeast amino acyl-tRNA synthetases occur in two copieswith a PHX nuclear version and a mitochondrial versionof relatively low expression level.

4.3. TCA Cycle Genes (Table VII)

The TCA cycle, apart from production of energy, cancontribute in myriad ways to cellular needs, especially inmaking precursors and intermediates to macromole-cules, e.g., in amino acid, vitamin, and heme biosynth-eses. The order of actions in the TCA cycle is: citratesynthase (GltA); aconitate hydratase (AcnA/AcnB),isocitrate dehydrogenase (Icd), 2-oxoglutaratedehydrogenase (SucA), succinyl coenzyme A (succinyl-CoA) synthetase (SucD and SucC), succinate dehydro-genase (SdhB, SdhC, and SdhD), fumarate hydratase(FumA or FumC) and malate dehydrogenase (Mdh/CitH).

Archaea Euk BacteriaGeneAfu Hbs Mja Mth Tac Pho Pab Ape Sce Aae Tma Dra Mtu Mle Lla Spy Bsu Syn Eco EcZ Pae Vch Hin Nme Hpy Cje Mlo Ccr Rpr Ctr Cpn Tpa Bbu Uur Mge

HXK - - - - - - - -250174 - - - - - - - - - - - - - - - - - - - - - - 84 - - -

Glk - - - - - - - - - - - - - - - - - 87 49 58 - - - 55 92 - 8265 64

- - - - - - -

Pgi - 80 56 83 97 - - 93 260 73 101 66 94 81 213 157 92 79 141 136 60 67 6046 63

99 94 97 77 - 93 75 74 83 - -

PfkA - - - - - - - -213 249 89

97 96

99 105 91 176 143 8910659

186 198 - 101 109 - - - 93 - -76 92

82 102

83 82

119 109

84 104

Fba - - - - - - - - 227 82 101 136 119 99 208 207 19959

12723847 39

23361

etc.103 188 146 86 92 89 106 99 - - - 100 99 93 81

FbaB98 85

119107

88 65

100 - 85 77 87 - 114 - - - - - - - -??

54 47

- - - - - - 74 - - 92 93 - - - -

TpiA 103 107 75 100 88 109 96 88 252 97 91 59 91 83 192 159 148 77 208 192 95 130 130 77 94 7583 77

58 - 96 82 99 105 84 84

GapA 103 113 106 89 102 88 105 84254 233 229

111 119 166 120 112 195 104 213 180

48130 107

20746 ?

21353 49

80 62 50

19342 etc

17280 74

92 85

97 175 154 - 96 105 104 109 - 91

Pgk 79 90 69 82 98 75 88 73 244 80 91 67 79 80 230 218 134 133 238 243 69 177 104 65 94 85 129 146 - 84 88 98 111 72 93

GpmA - - - - - - - -25445 38

- - - 105 113 227 188 - - 124 139 - - 123 80 - - 115 80 - 97 107 102 103 - -

GpmB - - - -95 90 89

- - -77 66 etc

90 72

9567 66 etc

117100 etc

94 93 etc

58 51 etc

66 54

70105 103 87

69 63

63 58

81 66

60 - - - -82 76 etc

66 62

- - - - - - -

GpmI - 99 - - - - - - - - - - - - - - 108 94 ? 185 59 147 - - 86 75 - - - - - - - 70 81

DeoB - - - - - - - - - - 83 68 - - 82 80 61 -16846

182 - 65 - - 80 - 67 - - - - - - - -

Eno 88 9376 63

102 9293 76

87 68

82235 222etc

98 111 147 101 9219649

217 192 112 211 214 91 193 159 93 85 92 142 128 - 79 89 98 81 93 78

PykF - 80 67 - 101 73 137 112 22543

- 8886 61

102 90 225 204 118 10470

25895 etc

26488 etc

68 57 49

17554 49

158 63 - 1021098169

53 - 78 91 - 84 70 108

TABLE V

EðgÞ Values (Multiplied by 100) for Glycolysis Genes

PHX genes are shown in red, PA in blue. Special symbols:}, The gene is not present in the genome; ?, COG data indicate that the gene is in the genome but the name does not match any

gene in the annotation; etc., more than three homologs belong to the COG. Top two EðgÞ values are shown. Genes: Hxk, hexokinase; Glk, glucokinase; Pgi, glucose-6-phosphate

isomerase; PfkA, 6-phosphofructokinase; Fba, fructose/tagatose bisphosphate aldolase; FbaB, DhnA-type fructose-1,6-bisphosphate aldolase and related enzymes; TpiA, triosephosphate

isomerase; GapA, glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenase; Pgk, 3-phosphoglycerate kinase; GpmA, phosphoglycerate mutase 1; GpmB,

phosphoglycerate mutase/fructose-2,6-bisphosphatase; GpmI, phosphoglyceromutase; DeoB, phosphopentomutase; Eno, enolase; PykF, pyruvate kinase.

Hete

rog

en

eity

of

Gen

om

ean

dP

rote

om

eC

on

ten

t379

Archaea Euk BacteriaGene Afu Hbs Mja Mth Tac Pho Pab Ape Sce Aae Tma Dra Mtu Mle Lla Spy Bsu Syn Eco EcZ Pae Vch Hin Nme Hpy Cje Mlo Ccr Rpr Ctr Cpn Tpa Bbu Uur Mge

AlaS 124 74 66 81 78 122 92 96 138 98 77 69 97 101 79 177 73 78 115 130 62 83 63 93 83 72 113 108 77 75 80 94 104 110 71

ArgS 97 ? 80 71 83 63 105 128 67 36

75 91 79 115 119 98 105 89 64 104 113 60 80 61 64 93 73 66 85 86 85 73 83 77 84 72

AspS 110 124 66 101 89 93 117 126 13527

114 99 70 59

97 104 83 70 73 104 195 191 77 100 101 92 86 84 151 137 82 95 92 76 81 74 73

AsnS - - - -94 89

92 82

107 81

- 169 40

- - ? - - 116 105 59

83 69 117 116 - 123 110 - - - - - - - - 87 78 89 84

CysS 74 91 - - 81 71 120 140 56 102 91 7491 71

92 78

43 40 55 90 89 92 68 63 63 74 87 66 69 81 89 83 89 90 107 71 89

GltX 72 72 77 93 83 82 115 120179 42

10585 80

78 60

101 85 107 75 60 89135 46

13643

59 59

86 46

72 74 64

88 82

10166

12070 69

66 94 82

92 76 97 109 70 77

GlnS - - - - - - - - 64 - - 83 - - - - - - 126 136 72 75 91 68 - - - 48 - - - - - - -

GlyQ - - - - - - - - - 104 83 - - - 78 92 60 95 118 120 72 124 114 81 89 85 95 82 99 76 90 - - - -

GlyS - - - - - - - - - 73 60 - - - 73 58 59 65 132 126 51 88 107 72 82 90 70 90 81 76 90 - - - -

GRS1 104 80 84 77 102 85 88 148 13529

- - 58 83 85 - - - - - - - - - - - - - - - - - 87 110 85 75

HisS 88 82 66 84 97 73 71 118 88 29

85 79

90 78

63 50

90 7553 44

50 49 40

87 82

74 658074

53 7984 66

89 7498 58

75 66

85 89 106 107 108 73 75

IleS 126 90 92 104 94 141 145 148 10425

114 103 62 78 75 97 65 78 60 117 123 60 85 ? 67 78 67 88 109 87 86 97 97 83 70 74

LeuS 124 98 108 73 100 127 141 139 82 29

11983

109 98 70 73 71 138 66 84 157 170 71 106 78 69 83 76 114 110 84 79 58 66 79 84 66

LysS 86 93 75 97 91 88 93 82 - - - - - - - - - - - - - - - - - - 90 76 79 - - 89 80 - -

LysU - - - - - - - -17442

89 77 8574 64

75 70

16132

109 109 93 18468

18366

78 136 105 75 86 65 - - - 85 83 - - 84 87

MetG 91 84 79 95 89 99 101 135 74 33

105 103 77 108 91 54 69 69 84 117 119 72 74 73 83 100 8777 69

105 99 100 87 82 84 83 80

PheS 96 105 98 87 85 77 83 128 11530

110 85 78 90 82 63 58 70 95 130 137 78 108 90 78 96 61 95 97 84 100 101 79 96 86 96

PheSA 120 - 59 92 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

PheT 76 110 74 67 82 76 91 10210930

86 89 65 81 72 65 62 52 89 116 117 56 103 77 59 78 71 92 116 80 69 86 86 100 72 108

ProS 67 103 72 78 107 85 78 121 13232

98 90 95 103 91 79 69 58 94 164 162 67 90 96 59 78 65 91 88 103 95 85 77 104 84 83

SerS 118 89 93 79 102 84 86 123 14239

98 92 98 109 87 127 124 70 92 164 176 78 117 98 69 86 82 87 65 97 89 77 101 112 110 74

ThrS 113 108 71 93 94 123 120 95 73

14240

115 102 89 116 87 133 107 11256

92 104 107 59 87 140 94 85 75 98 59 96 81 95 78 119 92 83

TrpS 10010578

65 90 102 93 77 8090 39

93 8994 63

88 87 90 54 54 92 91 117 57 64 69 70 85 74 118 87 96 88 83 76 101 83 79

TyrS 106 97 76 93 96 98 113 89 92 35

91 89 104 89 81 82 6411560

69 x

140 124 9756

72 40

97 81 86 69 116 89 79 91 88 90 95 77 77

ValS 115 96 84 87 101 120 124 127 127 117 106 83 95 74 78 89 65 82 181 181 80 118 102 71 93 90 115 103 80 89 78 92 118 69 72

TABLE VI

EðgÞ Values (Multiplied by 100) for Aminoacyl-tRNA Synthetases

PHX genes are shown in red, PA in blue. Special symbols:}the gene is not present in the genome; ?, COG data indicate that the gene is in the genome

but the name does not match any gene in the annotation; x, expression levels not evaluated (usually due to length580 aa); etc, more than three homologs

belong to the COG. Top two EðgÞ values are shown. Genes: AlaS, alanyl-tRNA synthetase: ArgS, arginyl-tRNA synthetase; AspS, aspartyl-tRNA

synthetase; AsnS, aspartyl/asparaginyl-tRNA synthetases, CysS, cysteinyl-tRNA synthetase; GltX, glutamyl-tRNA synthetase; GlnS, glutaminyl-tRNA

synthetase; GlyQ, glycyl-tRNA synthetase alpha subunit; GlyS, glycyl-tRNA synthetase beta subunit; GRS1, glycyl-tRNA synthetase class II; HisS,

histidyl-tRNA synthetase; IleS, isoleucyl-tRNA synthetase; LeuS, leucyl-tRNA synthetase; LysS lysyl-tRNA synthetase class I; LysU, lysyl-tRNA

synthetase class II; MetG, methionyl-tRNA synthetase; PheS, phenylalanyl-tRNA synthetase alpha subunit; PheSA, phenylalanyl-tRNA synthetase

alpha subunit archaeal type; PheT, phenylalanyl-tRNA synthetase beta subunit; ProS, prolyl-tRNA synthetase; SerS, seryl-tRNA synthetase; ThrS,

threonyl-tRNA synthetase; TrpS, tryptophanyl-tRNA synthetase; TyrS, tyrosyl-tRNA synthetase; ValS, valyl-tRNA synthetase.

Karlin

et

al.

380

Heterogeneity of Genome and Proteome Content 381

Archaeal genomes lack AcnB (but some have AcnA)and lack SucA (Table VIII). The anaerobic STRPYlacks all classical TCA cycle genes except for Mdh. Thespirochaetes (TREPA and BORBU) and mycoplasmas(e.g., UREUR and MYCGE) also lack TCA cycle genesexcept Mdh. However, Mdh is involved in severalmetabolic pathways including fermentation and CO2

fixation via the serine–isocitrate lyase pathway. On theother hand, DEIRA, ECOLI, MESLO, and CAUCRhave most of the TCA genes at the PHX level.The genes isocitrate lyase (AceA) and malate synthase

(AceB) of the glyoxylate shunt are restricted in bacterialgenomes and wholly absent from archaeal genomes.These genes are also missing from many small patho-genic bacteria, including RICPR, CHLTR, CHLPN,TREPA, BORBU, UREUR, MYCGE. The glyoxylateshunt is required to synthesize precurors for carbohy-drates when the carbon source is a C2 compound.Among archaea the only occurrence of the glyoxylatecycle is found in Haloferax volcanii. In ESCCO,isocitrate lyase converts isocitrate into succinate andglyoxylate, allowing carbon that entered the TCA cycleto bypass the formation of 2-oxoglutarate and succinyl-CoA. Among the current complete genomes, isocitratelyase is strongly PHX in DEIRA, MYCTU and PSEAE,is present but not PHX in MYCLE, ECOLI andVIBCH, and in the a-proteobacteria MESLO andCAUCR, and is absent from all other currentlysequenced prokaryotic genomes.

5. PROTEIN LENGTHS AND AMINOACID USAGES AMONG THE THREEDOMAINS

Table IX reports the median (50 percentile level)variation of protein lengths, amino acid usages, chargeand hydrophobic usages for the five complete eukar-yotes, 11 archaeal and 38 currently available bacterialgenome sequences; Two data sets were used for humansequences; the SWISS-PROT (SP) collection and thedraft human genome public version (HGP) released in2001. The numbers are remarkably congruent, indicatingthat SP provides a faithful representation of the completegenome. Several striking patterns are evident from TableVIII and we present some possible interpretations.

5.1. Protein Lengths across Proteomes

The median protein lengths (aa) of all completegenomes indicate the following orderings: archaea

(median range 230–250 aa except THEAC 268, PYRAE208, and AERPE 175) 5 bacteria (250–295 exceptNEIME 239 and STRPN 242) 5 eukaryotes (346–386).The same orderings hold when restricted to proteins ofsize � 200 aa (archaea 331–340, HALSP 344, THEVO347; bacteria 340–377, MESLO 337, LACLA 329,STAAU 338, STRPN 338, STRPY 335, UREUR 383;eukaryotes 433–473, CAEEL 401). The percent ofproteins � 200 aa relative to all proteins of the genomeis 52–76% in Archaea, AERPE 43%; 51–74% inbacteria, and 76–80% in eukaryotes. The greater lengthof eukaryotic proteins may reflect their intrinsic com-plexity which may be attributed to their multifunction-ality (several active regions of separate function), withconcomitant multiple exons and alternative splicing.The smaller size of archaeal proteins may relate to moreextreme environments and more specialized activity.

5.2. Charged Amino Acid Usages

For the three domains we compare median proteinfrequencies of acidic (E or D) and basic (K or R)residues. The usage of glutamate (E) pervasively exceedsthe usage of aspartate (D) by at least 1% (mostly 2%)with reversals in only six prokaryotes. These reversalstend to be in high G+C genomes (HALSP 68%;XYLFA 53%; CAUCR 67%; MESLO 63%; MYCTU66%; MYCLE 58%). However, other genomes of highG+C content (e.g., PSEAE 66%; SINME 63%;DEIRA 67%) do not show this property. Strongdifferences in charged amino acid usages are alsoobserved in the halophilic archaeon HALSP, whichshow unusually high levels of aspartate (D) usage(median 9.4% vs 4.3–6.2% in other organisms) andlow usage of basic residues (median 8.1% compared to9–13% in all other organisms). Acidic residues overalltend to be more used in euryarchaeal vs crenarchaealproteomes and in the anciently diverged Gram-negativespecies AQUAE and THEMA (medians 14.2 and14.3%). Usage values for acidic residues are in therange 11.4–12.2% for eukaryotes and 9.6–14.3% forarchaea and bacteria.What can account for E being used more than D?

There is no real distinction on the basis of theunderlying codon types (GAR vs GAY). Reasonscould reflect constraints on D in terms of proteinsecondary structure and size. For example, poly-E canestablish highly stable long a-helices whereas poly-Ddoes not form long a-helices or long b-strands(Richardson and Richardson, 1988). However, D ismore common at active sites (e.g., in proteases) andin calcium metal coordination. The five complete

Archaea Euk BacteriaGene Afu Hbs Mja Mth Tac Pho Pab Ape Sce Aae Tma Dra Mtu Mle Lla Spy Bsu Syn Eco EcZ Pae Vch Hin Nme Hpy Cje Mlo Ccr Rpr Ctr Cpn Tpa Bbu Uur Mge

GltA 101 116 - 84 76

103 98

- - 115 67 50 38

135 80 141 11997 92

114 37 -63 57 46

99 129

? 13972

79 70

96 49

- 97 66

100 8014965 etc

12455 48

101 - - - - - -

AcnA - 127 - - 111 - - 169 18359 53

124 - 256 118 86 37 - 108 - 29 ?

47 29

63 62

42 - 58 - - 161 155 79 - - - - - -

AcnB - - - - - - - - - - - - - - - - - 98 140 143 113 70 - 76 87 100 - - - - - - - - -

Icd 105 111 - - 99 - - 131 11947 34

113 115 207 92 - 45 - 133 95 165 162 71 - - - 111 - 146 152 - - - - - - -

LeuB 74 -85 68

105 101

- 8274 70

- 162 121etc

82 8715669

101 86 44 -50 43

76 71 ?

67 51

72 63 74 89 - 7513088

84 95 - - - - - -

Suc A - - - - - - - - 56 - - 170 128 104 - - 39 - 130 124 129 59 44 125 - - 140 134 76 87 79 - - - -

SucD 78 78

110 89 78 102 - - 133 79 119 113 - 165 108 97 - - 67 93

173 ? ?

18660 38

112 101 57 133 - 99136 118 170 100 98 85 - - - -

SucC 84 73

108 83 91 84 - - 121 45 116 123

- 171 114 103 - - 63 99 174 171 116 112 49 116 - 8511599

162 103 83 76 - - - -

SdhA 13084 78

100 67 69 98 - - 153 88 80 etc

92 93 120 135104 etc

103 61 - 57 74133 108

138 101

12352 44

137 117

127 135 82 11069

138 131 92 86 83 - - - -

SdhB 103 102 67 82 92 - - 136 82 95 78

- 121 12194 83

101 - - 63100 81

11355

12657

117 11350

109 137 93 75 74

112 100 88 92 89 - - - -

SdhC 97 91 - - 96 - - 132 77 48

- - 71 96 92 - - 64 - 81 83 112 63 - 101 87 95 99 75 94 91 85 - - - -

SdhD 97 - - - 103 - - 115 - - - 110 103 99 - - - - 88 86 112 89 - 108 - - 94 73 91 91 - - - - -

FumC - 101 - - - - - 89 84 - - 169 106 100 - - 54 75 32 4073 57

41 92 65 99 92 113 122 89 90 91 - - - -

TtdAa 87 - 7882 77

- 89 79 - - 96 95 - - - - - - -11060 56

11455 etc

77 47 - 72 - - 64 - - - - - - - -

FumA a 95 - 649680

- 85 84 - - 107 93 90

- - - - - - -11087 56

11487 etc

77 47 - 72 - - 64 - - - - - - - -

Mdh 95 102 66 91 112 - - 135 11755 45

86 75

102 180 105

122 12322645 44

172 81 76

92 142 132 - 66 118 - -88 75

130 168 92 78 90 - 119 - 84

TABLE VII

EðgÞ Values (Multiplied by 100) for TCA Cycle Genes

aSome proteins occur in both TtdA and FumA COGs. PHX genes are shown in red, PA in blue. Special symbols:}The gene is not present in the genome; ?, COG data indicate that the

gene is in the genome but the name does not match any gene in the annotation; etc, More than three homologs belong to the COG. Top two EðgÞ values are shown. Genes: GltA, citratesynthase; AcnA, aconitase A; AcnB, aconitase B; Icd, isocitrate dehydrogenases; LeuB, isocitrate/isopropylmalate dehydrogenase; SucA, pyruvate and 2-oxoglutarate dehydrogenases

E1 component; SucD, succinyl-CoA synthetase alpha subunit; SucC, succinyl-CoA synthetase beta subunit; SdhA, succinate dehydrogenase/fumarate reductase flavoprotein subunits;

SdhB, succinate dehydrogenase/fumarate reductase Fe-S protein; SdhC, succinate dehydrogenase/fumarate reductase cytochrome b subunit; SdhD, succinate dehydrogenase

hydrophobic anchor subunit; FumC, Fumarase; TtdA, tartrate dehydratase alpha subunit/fumarate hydratase class I; FumA, tartrate dehydratase beta subunit/fumarate hydratase class

I; Mdh, malate/lactate dehydrogenases. Karlin

et

al.

382

Heterogeneity of Genome and Proteome Content 383

eukaryotic genomes invariably have (E+D)% >(K+R)%. As with eukaryotes, most prokaryoticproteins use more acidic residues than basic residues.But there are several examples with (K+R)% >(E+D)%, including BORBU, MYCGE, TREPA,BUCSP, RICPR, HELPY, UREUR, MYCPN, PYR-AE, SULSO, CLOAC.

5.3. Hydrophobic Residue Usages

The median usage of hydrophobic residues ineukaryotes is lower than in archaea and bacteria(36.4–38.4% vs 38.7–43.5%, except for STAAU,38.2%). This observation is somewhat unexpected giventhat eukaryotic proteins tend to be longer thanprokaryotic proteins and might contain proportionallylarger hydrophobic cores. A possible explanation is thateukaryotic structures tend to be multi-domain ratherthan consisting of compact globular units. Amongbacteria, the highest median hydrophobic usage isexhibited by a-proteobacteria (42.3–43.5%) and myco-bacteria (43%).

5.4. Correlations of G+C Genome Content andAmino Acid Usages

We observe a positive correlation of the genomefrequency of G+C with the frequency of amino acidsencoded from strongly binding bases {Ala, Gly, Pro,Trp, Arg} and a negative correlation with the frequencyof amino acids encoded from weak bases {Lys, Ile, Phe,Tyr, Asn}.

5.5. Amino Acid Usages in Thermophiles

Among thermophiles optimal growth temperaturesrange from 508 to 1008C; but there is no correlation ofamino acid usage with optimal growth temperatures. Todate, all sequenced archaeal genome sequences, with theexception of HALSP, are thermophilic organisms. Twohyperthermophilic bacterial genomes, AQUAE andTHEMA, are available. What is the nature of aminoacid usages of thermophiles (or hyperthermophiles)compared to mesophiles? The following features standout: Thermophilic organisms persistently show a highercharge usage than mesophilic organisms of similarG+C content. The only exception is HALSP, whoseproteins tend to be richer in Asp representationscompared to other proteomes. The greater charge inproteins of thermophiles presumably implicates moresalt-bridge connections in their 3D structure andconcomitantly greater structural stability.

The strong amino acid Ala and Gly frequencies aresignificantly correlated with G+C content. Discountingthis correlation, thermophilic organisms generally showlower frequencies of Ala and higher frequencies of Glythan mesophilic organisms. This is particularly obviousfor the frequency of Ala in THEMA (5.7%) andAQUAE (5.6%), compared to about 7.0–8.0% inmesophiles of similar G+C content.

b-branched residues are suggested as favorable instabilizing thermophilic proteins (Gromiha et al., 1999).Frequencies of Val and especially Ile show somecorrelation (positive and negative, respectively) withgenome G+C content. In particular, Ile and Valfrequencies tend to be increased in most thermophilicarchaeal proteomes and, in the case of Val, also inTHEMA and AQUAE.

6. CONCLUDING REMARKS ANDPROSPECTS

Among all prokaryotic genomes (> 54 currentlyavailable with more than aggregate one million codons)there are only 76 gene sequences common to all genomesof which more than half are ribosomal proteinsequences, more than a dozen are amino acetyl tRNAsynthetases. Also, several are major protein processingfactors, and few are chaperone complexes. The genomicand proteomic evaluations of the text and of Tables II–IX show substantial differences among prokaryoticgenomes. Woese et al. (1990), based on 16S rRNAcomparisons for a wide variety of organisms, haveargued for partitioning all independent organisms intothree sets: bacteria, archaea, and eukaryotes. Woesenotes that bacterial cell membranes generally are formedof glycerol long diester hydrocarbon chains, whilearchaeal membranes are formed of isoprenoid glyceroldiethers. Eukaryotes generally possess membranesformed of glycolipids. We have examined more than20 properties, discussed in the foregoing text, of thethree proposed ‘‘domains’’ and have found manydiscrepancies for different properties. Poole et al.(1999) argue for coherence of each domain but ‘‘withmany exceptions.’’ Significant differences between eu-karyotes, archaea, and bacteria are found in thedistribution of protein lengths and for various classesof proteins including chaperonins, informational andmetabolic proteins. Table IX summarizes presence,absence, and expression levels for a broad spectrum ofgenomic and proteomic attributes. Differences among

TABLE VIII

Median Protein Length and Amino Acid Usages in Eukaryotic, Archaeal, and Bacterial Proteomes

Speciesa %

G+C

No. of sequences Median length Median amino acid usages in proteins5200 aa

All 5200 aa %5200 All 5200 E D K R H L I V M F Y W P G A S T N Q C � þ p fb

HUM-SPc 42 6194 4824 78 372 447 6.5 4.9 5.4 5.3 2.4 9.5 4.6 6.2 2.2 3.8 3.0 1.2 5.3 6.7 6.9 7.4 5.2 3.6 4.2 1.9 11.5 10.9 38.9 37.6

HUMAN 42 13245 10651 80 386 456 6.6 4.8 5.5 5.4 2.4 9.8 4.4 6.1 2.2 3.7 2.8 1.2 5.5 6.4 6.8 7.6 5.1 3.6 4.3 1.9 11.6 11.1 38.8 37.4

DROME 42 14100 10740 76 375 470 6.1 5.2 5.3 5.5 2.5 9.1 4.9 5.9 2.3 3.6 3.0 0.9 4.9 5.9 7.2 7.6 5.3 4.5 4.5 1.7 11.4 10.9 39.3 36.9

CAEEL 36 17083 13668 80 346 401 6.0 5.1 6.1 4.9 2.2 8.9 6.2 6.2 2.6 4.9 3.2 1.0 4.4 4.8 5.9 7.7 5.6 4.8 3.7 1.8 11.4 11.4 37.4 38.4

YEAST 38 6066 4782 79 384 473 6.4 5.8 7.2 4.3 2.1 9.4 6.5 5.5 2.0 4.3 3.3 0.9 4.2 4.9 5.2 8.3 5.6 5.8 3.7 1.2 12.3 11.7 38.6 36.4

ARATH 36 25338 19784 78 361 433 6.5 5.4 6.2 5.3 2.2 9.3 5.3 6.7 2.3 4.3 2.8 1.2 4.6 6.1 6.1 8.6 5.1 4.3 3.2 1.7 12.0 11.7 37.7 37.9

PYRAB 45 1763 1192 68 265 339 8.9 4.7 7.7 5.6 1.4 9.9 8.4 8.0 2.3 4.2 3.8 1.0 4.2 7.3 6.6 4.8 4.1 3.2 1.5 0.3 14.1 13.5 31.1 41.0

PYRHO 42 2058 1181 57 232 334 8.8 4.6 8.0 5.4 1.4 9.9 8.7 7.8 2.3 4.3 3.8 1.0 4.2 7.1 6.3 5.0 4.2 3.3 1.5 0.3 13.7 13.5 31.5 40.8

ARCFU 49 2409 1438 60 241 334 9.0 5.0 6.8 5.6 1.4 9.2 7.1 8.5 2.5 4.4 3.6 0.9 3.8 7.4 7.8 5.4 4.1 3.0 1.6 0.9 14.3 12.6 31.0 41.5

THEAC 46 1478 1001 68 268 340 6.3 6.2 5.6 5.6 1.6 8.1 8.9 7.1 3.2 4.4 4.4 0.7 3.9 7.3 7.0 7.4 4.6 3.9 1.9 0.4 12.9 11.8 35.7 39.2

THEVO 40 1499 974 65 259 347 6.7 6.0 7.1 4.7 1.4 8.3 9.0 7.0 2.6 4.4 4.6 0.7 3.7 7.1 6.4 7.4 4.7 4.4 1.9 0.4 13.0 12.1 35.8 38.8

METTH 50 1869 1123 60 242 336 8.4 6.1 4.0 6.9 1.8 9.1 7.6 7.7 3.0 3.5 2.9 0.7 4.3 8.1 7.4 6.0 4.8 2.9 1.7 0.9 14.8 11.1 33.2 40.3

METJA 31 1773 1076 61 241 336 8.7 5.7 10.4 3.7 1.3 9.2 10.4 6.7 2.1 4.0 4.2 0.5 3.3 6.3 5.3 4.3 4.0 5.1 1.3 1.0 14.6 14.2 30.6 39.9

HALSP 68 2058 1554 76 250 344 6.7 9.4 1.4 6.4 2.2 8.2 3.6 9.1 1.5 3.0 2.5 0.9 4.6 8.4 12.5 5.0 6.6 2.1 2.5 0.6 16.5 8.1 34.6 40.0

AERPE 56 2694 1164 43 175 331 7.4 4.3 3.5 7.6 1.5 11.0 5.3 9.1 2.1 2.7 3.4 1.2 5.3 8.9 9.6 6.3 4.0 1.8 1.6 0.5 11.8 11.7 33.8 42.0

PYRAE 51 2603 1351 52 208 327 7.1 4.4 5.4 6.4 1.4 10.2 6.2 9.4 1.8 3.5 4.2 1.2 5.0 7.8 9.8 4.8 4.3 2.5 1.9 0.6 11.8 12.3 32.4 43.1

SULSO 36 2982 1842 62 251 332 7.2 5.0 8.0 4.6 1.2 10.2 9.3 7.4 2.0 4.2 4.7 0.9 3.7 6.4 5.5 6.4 4.5 4.7 1.9 0.4 12.5 12.9 34.1 40.1

ECOLI 51 4286 2922 68 278 351 5.9 5.2 4.1 5.5 2.2 10.6 5.9 7.0 2.7 3.7 2.7 1.4 4.4 7.3 9.4 5.6 5.2 3.7 4.2 1.1 11.3 9.9 36.1 41.7

HAEIN 38 1707 1118 65 262 342 6.5 5.0 6.1 4.2 2.0 10.5 7.0 6.5 2.4 4.2 3.0 1.0 3.7 6.7 8.1 5.7 5.1 4.7 4.4 1.0 11.8 10.7 36.0 40.7

PSEAE 67 5564 4000 72 291 348 6.2 5.4 2.4 7.7 2.1 12.3 4.0 6.9 1.9 3.5 2.4 1.4 5.0 8.3 11.6 5.3 4.0 2.4 4.0 0.9 11.8 10.6 34.2 42.7

BUCSP 26 564 395 70 282 362 5.2 4.3 9.6 3.6 2.1 9.8 11.4 4.7 2.0 4.8 3.7 0.8 3.0 5.3 4.3 7.3 4.6 7.1 3.1 1.1 9.6 13.7 36.7 39.4

VIBCH 47 3835 2407 63 259 363 6.2 5.1 4.6 4.9 2.3 10.7 6.0 6.9 2.6 3.8 2.9 1.2 4.0 6.6 9.2 6.1 5.1 3.7 4.9 0.9 11.5 9.7 36.4 41.3

PASMU 40 2014 1442 72 286 347 6.0 4.9 5.6 4.3 2.3 10.9 6.7 6.7 2.4 4.2 3.1 1.0 3.9 6.5 8.5 5.5 5.2 4.1 4.8 1.0 11.2 10.2 36.3 41.5

XYLFA 53 2766 1400 51 202 359 5.0 5.4 3.0 6.8 2.6 10.8 5.2 7.4 2.1 3.3 2.5 1.3 5.0 7.5 10.3 5.6 5.5 3.0 4.1 1.0 10.6 10.1 36.2 41.9

NEIME 52 2025 1189 59 239 351 6.2 5.4 5.2 5.4 2.1 9.9 5.8 6.9 2.4 3.8 2.8 1.0 4.2 7.8 10.1 5.4 5.1 3.7 3.8 0.9 11.8 10.8 35.4 41.1

AGRTU 59 2721 1846 68 271 342 5.9 5.7 3.8 6.4 1.9 9.6 5.7 7.3 2.7 3.9 2.2 1.1 4.8 8.3 11.4 5.7 5.2 2.8 2.9 0.7 11.9 10.6 34.4 42.3

AGRTU-Ld 59 1833 1456 79 315 343 5.8 5.6 3.4 6.5 2.0 9.8 5.7 7.3 2.6 3.9 2.2 1.2 4.8 8.2 11.4 5.8 5.3 2.9 3.0 0.7 11.7 10.4 34.6 42.6

CAUCR 67 3737 2539 68 275 353 5.4 5.9 3.2 7.1 1.7 9.9 4.3 7.5 2.2 3.4 2.0 1.3 5.4 8.8 13.8 4.9 5.0 2.1 3.0 0.7 11.5 10.6 33.6 43.5

MESLO 63 7281 4826 66 268 337 5.4 5.8 3.3 6.9 2.0 9.8 5.3 7.4 2.4 3.7 2.1 1.2 5.0 8.5 12.2 5.5 5.1 2.6 2.9 0.7 11.5 10.6 34.3 43.0

SINME 63 3341 2302 69 276 340 6.2 5.7 3.3 7.0 1.9 9.7 5.4 7.4 2.4 3.8 2.1 1.1 4.9 8.5 12.0 5.5 5.1 2.6 2.7 0.7 12.1 10.6 33.8 42.7

SINMEpAe 60 1294 836 65 265 330 5.9 5.3 2.9 7.2 2.1 10.1 5.4 7.5 2.5 3.8 2.1 1.2 5.0 8.2 11.8 5.7 5.2 2.6 2.9 0.8 11.6 10.4 34.4 43.0

RICPR 29 834 582 70 282 358 5.8 4.9 8.3 3.1 1.8 10.1 10.8 5.5 2.0 4.6 4.0 0.6 3.1 5.2 6.0 6.6 5.0 6.4 3.0 1.0 10.9 11.9 36.0 40.5

HELPY 39 1575 1042 66 266 358 6.9 4.7 8.9 3.4 2.1 11.2 7.1 5.4 2.1 5.2 3.6 0.5 3.2 5.4 6.9 6.7 4.1 5.5 3.5 1.0 11.9 12.4 34.7 40.0

CAMJE 31 1635 1122 69 268 348 7.1 5.3 9.5 2.8 1.6 10.8 8.5 5.1 2.1 5.8 3.5 0.5 2.6 5.4 6.7 6.3 3.9 6.0 3.0 1.2 12.7 12.6 33.2 40.7

CHLTR 41 893 625 70 289 377 6.5 4.6 5.4 4.7 2.2 11.3 6.6 6.3 1.9 4.8 3.0 0.8 4.4 6.2 7.4 7.8 4.9 3.2 4.0 1.6 11.2 10.3 36.5 41.0

CHLPN 41 1052 731 69 289 372 6.5 4.6 5.8 4.4 2.3 11.4 6.9 6.1 1.8 4.7 3.2 0.9 4.4 6.0 6.9 7.7 5.1 3.5 3.9 1.6 11.2 10.4 37.0 40.3

CHLMU 40 818 558 68 287 375 6.4 4.5 5.6 4.7 2.2 11.1 6.6 6.4 2.0 4.8 3.0 0.8 4.3 6.2 7.1 8.1 4.8 3.3 4.1 1.6 11.1 10.5 36.6 40.8

BORBU 29 850 599 70 285 358 6.7 5.2 9.9 3.0 1.2 10.4 10.4 5.3 1.7 6.0 4.1 0.4 2.5 5.1 4.3 7.3 3.7 7.0 2.1 0.5 12.1 13.2 34.2 39.9

TREPA 53 1030 743 72 293 367 5.8 4.5 3.6 7.2 2.8 10.0 4.7 8.2 2.0 4.3 3.0 0.9 4.1 6.9 10.2 6.4 5.2 2.3 3.7 1.8 10.5 11.2 34.9 42.6

DEIRA 67 3117 2073 67 263 344 5.8 5.2 2.3 7.3 2.0 11.4 3.0 7.7 1.7 3.0 2.2 1.2 5.8 9.1 12.3 4.9 5.5 2.2 3.9 0.5 11.0 10.0 36.5 41.5

Karlin

et

al.

384

THEMA

46

1846

1308

71

284

344

9.0

5.0

7.5

5.5

1.5

9.8

7.1

8.7

2.2

4.9

3.4

0.9

4.0

7.0

5.7

5.4

4.4

3.5

1.8

0.5

14.3

13.1

31.9

40.2

AQUAE

43

1521

1072

70

274

340

9.6

4.3

9.3

4.8

1.5

10.4

7.2

7.9

1.7

4.8

4.0

0.8

4.0

6.8

5.6

4.7

4.2

3.4

1.9

0.6

14.2

14.3

31.2

39.4

MYCTU

66

3909

2766

71

286

360

4.7

6.0

1.8

7.5

2.2

9.8

4.2

8.6

1.8

2.8

2.0

1.3

5.7

8.7

13.1

5.3

5.8

2.1

3.0

0.8

11.1

9.6

35.5

43.0

MYCLE

58

1605

1100

69

282

358

5.0

6.0

2.4

6.9

2.2

10.0

4.7

9.2

2.0

2.8

2.1

1.2

5.2

8.5

11.8

5.8

6.0

2.5

3.1

0.9

11.2

9.5

35.7

43.0

BACSU

44

4097

2572

63

256

340

7.4

5.3

6.8

4.0

2.2

9.5

7.2

6.8

2.6

4.2

3.3

0.9

3.7

7.0

7.6

6.2

5.3

3.7

3.6

0.7

13.0

11.0

35.7

39.4

BACHA

44

4066

2620

64

261

341

7.9

5.1

5.7

4.7

2.3

9.8

6.8

7.4

2.6

4.1

3.3

1.0

3.8

7.0

7.3

5.6

5.5

3.5

3.9

0.6

13.2

10.5

35.5

39.5

LACLA

35

2266

1409

62

251

329

7.3

5.4

7.2

3.4

1.7

9.7

7.5

6.6

2.4

4.5

3.4

0.8

3.3

6.5

7.3

6.3

5.5

5.0

3.5

0.3

13.0

10.9

35.7

39.3

STAAU

33

2714

1715

63

254

338

6.5

6.0

7.2

3.4

2.2

8.9

8.4

6.7

2.6

4.3

3.8

0.6

3.2

6.2

6.1

5.9

5.6

5.3

3.9

0.5

12.9

10.8

36.7

38.2

STRPN

40

2094

1255

60

242

338

7.3

5.7

6.4

3.9

1.8

10.1

7.2

7.0

2.4

4.5

3.6

0.7

3.3

6.6

7.2

6.0

5.3

4.1

3.9

0.4

13.4

10.5

35.5

39.8

STRPY

39

1696

1127

66

263

335

6.3

6.0

6.6

3.8

2.0

9.9

7.4

6.7

2.5

4.2

3.5

0.7

3.4

6.4

7.6

6.0

5.8

4.2

4.1

0.5

12.6

10.7

36.2

39.7

CLOAC

31

3672

2418

66

262

342

7.1

5.7

9.5

3.3

1.2

8.5

9.6

6.7

2.5

4.3

4.1

0.5

2.7

6.2

5.4

6.5

4.7

6.3

2.1

1.1

12.9

13.0

34.3

38.7

UREUR

25

611

418

68

286

383

5.7

5.7

9.8

2.6

1.6

10.0

10.5

5.3

1.7

5.1

4.3

0.8

2.5

4.1

4.8

5.9

4.8

8.5

3.7

0.6

11.6

12.7

36.3

39.0

MYCGE

32

466

344

74

292

369

5.4

4.9

9.6

2.8

1.5

10.7

8.4

6.1

1.5

6.1

3.2

0.8

2.9

4.5

5.4

6.3

5.2

7.4

4.4

0.8

10.5

12.7

36.0

40.4

MYCPN

40

677

480

71

286

354

5.3

5.0

8.4

3.3

1.7

10.4

6.7

6.4

1.5

5.5

3.2

1.0

3.3

5.3

6.6

5.9

5.7

6.1

5.1

0.7

10.5

11.8

36.9

39.7

SYNY3

48

3163

2086

66

274

360

5.9

5.0

3.9

5.0

1.8

11.4

6.1

6.7

1.9

3.8

2.9

1.4

5.1

7.3

8.5

5.6

5.3

3.7

5.5

0.9

11.0

9.1

38.0

40.9

aSpeciesnames

generallyfollowtheSwissProtconvention}firstthreelettersofthegenusnamefollowed

bythefirsttwolettersofthespeciesname.

bLehninger

alphabet:-isE,D;+

isK,R;p(polaruncharged)isH,Y,P,G,S,T,N,Q;f(hydrophobics)isL,I,V,A,M

,F,W

,C.

cSwissProtcollectionofhumanproteins.

dLinearchromosomeofA

gro

bact

eriu

mtu

mef

aci

ens.

ePlasm

idpSymAofSin

orh

izobiu

mm

eliloti.

Heterogeneity of Genome and Proteome Content 385

genomes are sensitive to lifestyle, habitat, food sources,physiology, type of membranes, and many other factors.For example, among fermentation processes, there aremany alternative end-products depending on environ-mental conditions and degree of resistance to an acidicacid milieu, e.g., acetate, lactate, ethanol.Many theories have been proposed on domains of life,

the origin and early evolution of eukaryotes, and thegenesis of organelles. 16S rRNA genes give resultsconflicting with many protein sequence comparisons. Itis increasingly appreciated that the genomes of manyprokaryotes and primitive eukaryotes are‘‘heterogeneous unions’’ in which lateral transfer and/or close associations have been at work (Doolittle, 1998;Ochman et al., 2000; Campbell, 2000). The currentarchaeal genome number is 13 and, to date, numerousothers have not been sequenced. Also, Woese (1998) nolonger prescribes a single progenote as the genesis of lifebut rather a ‘‘community’’ of initial life forms involvingmuch lateral exchange among them. Along these lines,there have been proposed archaeal–bacterial partner-ships preceding the origin of eukaryotes (Zillig et al.,1989; Gupta and Golding, 1996; Martin and M .uuller,1998; Lopez-Garcia and Moreira, 1999; Karlin et al.,1999). Hartman and Fedorov (2002) postulate that theeukaryotic domain was established as a union of threegenome types: a bacterium, an archaeon, and a‘‘chronocyte,’’ the latter contributing to several basiceukaryotic systems including spliceosomal mechanisms,capping enzymes, nuclear pore constituents, and en-doskeletal apparatus allowing for the functionalities ofendocytosis, signaling and control. A primitivechronocyte is apparently no longer extant. Specificitiesand antecedents of these three cell types are alsounclear.Conventional methods of phylogenetic reconstruction

from sequence information use only similarity ordissimilarity assessments of aligned homologous genesor regions. For a detailed review of problems ofinferring phylogeny, see Brocchieri (2001) and Gribaldoand Philippe (2002). Difficulties intrinsic in phylogeneticmethods include the following: (i) Alignments ofdistantly related long sequences (e.g. complete genomes)are generally not feasible. (ii) Different phylogeneticreconstructions may result for the same set of organismsbased on analysis of different protein, gene, or noncod-ing sequences. Attempts are made to overcome this byaveraging over many proteins or by concatenatingsequences (Daubin and Gouy, 2001) mostly restrictedto classes of ribosomal protein genes. Even the numbersof RPs among prokaryotic genomes differ. (iii) Resul-tant trees may be highly dependent on details of the

TABLE IX

Various Summary Comparisons of Genomic and Proteomic Properties

Property Bacteria Archaea Eukaryotes

Shine}Dalgarno sequences +

contributing to translation + Many Crenarchaea have �initiation mostly leaderless Cap structure

transcripts

GC skew + For bacterial chromosomes

with a single origin � �(exceptions: DEIRA,

SYNSQ, THEMA, AQUAE)

Periodic 30 bp repeats � + in all �(exceptions BACHA, thermophiles

THEMA, AQUAE) � in HALSP

Aggregate 4 (and 6Þ bp + + �palindromes}under-represented

Genome of single chromosomes +

(Exceptions: VIBCH, + �DEIRA, SINME, RHOSP)

Linear chromosomes BORBU, AGRTU, RHOSP; � +

Streptomyces; most genomes are circular

Extremes of genome signature; Variable Variable Variable

dinucleotide relative abundances TA and CG mostly low

in vertebrates

Nitrogen fixation (+ and �) (+ and �) �Lps (lipopolysaccharide) family (+ and �) (+ and �) �Lipid A Gram-negative + � �

Gram-positive �Chaperones/degradation + + +

HSP60 GroEL (not in UREUR) Thermosome Tcp1

HSP70 + Mostly missing; present

(DnaK) in THEAC, METTH, Generally +

HALSP

Prefoldin complex (GimC) � + +

Hexamer structure 2a; 4b 2a and 4bbut different than archaea

Trigger factor + � �HSP90 Variable � +

Thioredoxin + + +

Peptidyl-prolyl cis–trans

isomerases; 3 families:

Cyclophilins + HALSP, METTH +

Fkbp + + +

Parvulin + � +

Lon Variable � �FtsZ (division control protein) + Variable; not found in �

Crenarchaea Very weak similarity

between archaeal FtsZ

and tubulin

Ribosomal protein S1 (rpsA) + � �(generally > 500 aa length)

Cluster of RP genes + +, � mixed �S2 Separated included in short cluster Not present

Ribosomal proteins P0; P1; P2 � only P0 +

acidic, regulatory

No. of RPs 50–63 53–65 78–79 in higher

eukaryotes

72 in GIALA

Karlin et al.386

Table IX (continued)

Property Bacteria Archaea Eukaryotes

Existence of introns � � euryarchaea +

+Crenarchaea

Phosphotransferase (PTS) system Variable � �Hexokinase and/or glucokinase Variable (most bacterial

organisms use the PTS system) � +

Translation elongation factors Tuf Multiple copies in One copy Many

Fus proteobacteria a, b, gNo. of origins of replications Generally one (possible Unknown; no GC skew Multiple

exceptions: DEIRA,

SYNSQ, ARCFU, THEMA)

PCNA replication factor � + +

Reverse Gyrase AQUAE, THEMA,

not found in mesophilic or + hyperthermophiles �moderate thermophilic

prokaryotes

Protein median lengths Range 260–295 aa; 230–250 aa 346–384 aa

among proteins 580 aa except NEIME 239 THEAC 268, PYRAE

208, AERPE 175

%proteins 5200 aa in the genome Range 51–74% 52–76% 76–80%

Heterogeneity of Genome and Proteome Content 387

alignment algorithm used, inadequacies of phylogeneticmethods, and biases in species and sequence sampling.(iv) Long branch attraction, mutational saturation, anddifferent selection processes make ancient branchingunreliable. (v) Chimeric origins, recombination, inver-sions, transpositions, and lateral transfer betweendistantly related organisms can complicate analyses.(vi) Tree construction derived from aligned sequencescannot apply to organisms for which similar genesequences are largely unavailable (e.g., for bacterio-phages, eukaryotic viruses, or deeply divergent organ-isms). (vii) Problems of influences of unrecognizedparalogy and widespread reductions and expansions ofgenome content. Translation of sequence similaritiesinto evolutionary relatedness will always be question-able as the underlying assumptions about mutationrates, selection forces and gene transfer events areuncertain. All the models of sequence evolution used areundoubtedly far from the real evolutionary mechanisms.The three-domain hypothesis and the endosymbiont

hypothesis have undergone many changes. First, theoriginal reason for dividing the living world into threedomains was that there were, on the initial evidence,approximately three sets and that these were aboutequally ‘‘deviant’’ from one another. However, Table IXshows great variability within and among the three‘‘domains.’’ Insistence on three domains leaves frozen aclassification based on the limited knowledge availablein the past (Karlin et al., 1997). If A (Archaea) and K(eukoryotes) are more closely related than either is to B

(Bacteria) (another point of controversy), then A and Kare in the same domain. Otherwise, why not defineadditional primary domains within these three domains.Recent controversy has arisen regarding the problem oflocating the bacterial root. In this context there are nowa number of proposals placing the root on theeukaryotic branch (Poole et al., 1999; Forterre andPhilippe, 1999), with later reductions producing theprokaryotes.With the range of protein sequences now available,

the nuclear genome appears to be chimeric. If it arose byfusion of two (or more) genomes, each genome musthave had its own 16S rRNA, one of which might thenhave broken off to inhabit the organelle. This leavesmany possible scenarios. Our favorite compresses amulti-organismal fusion and the endosymbiont invasioninto a single event (Karlin et al., 1999). The chimericnature of the nuclear genome could then result primarilyfrom migration into the nucleus of many genes, not justthose affecting organellar function. We consider theSulfolobus line as a likely candidate for the endosym-biont, particularly of animal mitochondria, for reasonsoutlined previously (Karlin et al., 1999). These reasonsinclude similarities in genome signature.The patterns of dinucleotide relative abundance

values (genomic signatures) are about the same forevery contig of at least 50 kb length from the sameorganism but significantly different for those fromdifferent organisms. The uniformity of the signaturethroughout each genome suggests a recent acquisition

Karlin et al.388

(on an evolutionary time scale). Mechanisms for theevolution and maintenance of the genomic signature areunknown although there are data to suggest thatgenome-wide processes, including DNA replicationand repair, contribute intrinsically to the genomicsignature (Karlin and Burge, 1995; Blaisdell et al.,1996; Karlin, 1998). The genomic signature is also usefulfor detecting pathogenicity islands (Karlin, 2001) inbacterial genomes and in conveying strong influences oncodon usages.

ACKNOWLEDGMENTS

Supported in part by NIH Grants 5R01GM10452-40 and

5R01HG00335-15.

REFERENCES

Antelmann, H., Bernhardt, J., Schmid, R., Mach, H., V .oolker, U., and

Hecker, M. 1997. First steps from two-dimensional protein index

towards a response regulation map for Bacillus subtilis, Electro-

phoresis 18, 1451–1463.

Benachenhou-Lahfa, N., Labedan, B., and Forterre, P. 1994. PCR-

mediated cloning and sequencing of the gene encoding glutamate

dehydrogenase from the archaeon Sulfolobus shibatae: Identifica-

tion of putative amino-acid signatures for extremophilic adapta-

tion, Gene 140, 17–24.

Beutler, E., Gelbart, T., Han, J. H., Koziol, J. A., and Beutler B. 1989.

Evolution of the genome and the genetic code: Selection at the

dinucleotide level by methylation and polyribonucleotide cleavage,

Proc. Natl. Acad. Sci. USA 86, 192-196.

Blaisdell, B. E., Campbell, A. M., and Karlin, S. 1996. Similarities and

dissimilarities of phage genomes, Proc. Natl. Acad. Sci. USA 93,

5854–5859.

Blumenthal, A. B., Kriegstein, H. J., and Hogness, D. S. 1974. The

units of DNA replication in Drosophila melanogaster chromo-

somes, Cold Spring Harbor Symp. Quant. Biol. 38, 205–223.

Brocchieri, L. 2001. Phylogenetic inferences from molecular sequences:

Review and critique, Theor. Popul. Biol. 59, 27–40.

Brown, J. R., and Doolittle, W. F. 1997. Archaea and the prokaryote-

to-eukaryote transition, Microbiol. Mol. Biol. Rev. 61, 456–502.

Brown, J. R., Masuchi, Y., Robb, F. T. and Doolittle, W. F. 1994.

Evolutionary relationships of bacterial and archaeal glutamine

synthetase genes, J. Mol. Evol. 38, 566–576.

Calladine, C. R., and Drew, H. R. 1992. ‘‘Understanding DNA,’’

Academic Press, San Diego.

Campbell, A. M. 2000. Lateral gene transfer in prokaryotes. Theor.

Popul. Biol. 57, 71–77.

Cardon, L. R., Burge, C., Clayton, D. A., and Karlin, S. 1994.

Pervasive CpG suppression in animal mitochondrial genomes,

Proc. Natl. Acad. Sci. USA 91, 3799–3803.

Creti, R., Ceccarelli, E., Bocchetta, M., Sanangelantoni, A. M.,

Tiboni, O., Palm, P., and Cammarano, P. 1994. Evolution of

translational elongation factor (EF) sequences: Reliability of global

phylogenies inferred from EF-1 alpha(Tu) and EF-2(G) proteins,

Proc. Natl. Acad. Sci. USA 91, 3255–3259.

Daubin, V., and Gouy, M. 2001. Bacterial molecular phylogeny using

supertree approach, Genome Inform. Ser. Workshop Genome

Inform. 12, 155–164.

Deuerling, E., Schulze-Specking, A., Tomoyasu, T., Mogk, A., and

Bukau, B. 1999. Trigger factor and DnaK cooperate in folding of

newly synthesized proteins, Nature 400, 693–696.

Doolittle, W. F. 1998. You are what you eat: A gene transfer ratchet

could account for bacterial genes in eukaryotic nuclear genomes,

Trends Genet. 14, 307–311.

Draper, D. E. 1996. Translational initiation, in ‘‘Escherichia coli

and Salmonella: Cellular and Molecular Biology,’’ 2nd ed.

(F. C. Neidhardt, R. Curtiss III, J. L. Ingraham, E. C. C. Lin,

K. B. Low, B. Magasanik, W. S. Reznikoff, M. Riley, M.

Schaechter and H. E. Umbarger, Eds.), pp. 902–908., ASM Press,

Washington, DC.

Echols, H. and Goodman, M. F. 1991. Fidelity mechanisms in DNA

replication, Annu. Rev. Biochem. 60, 477–511.

Forterre, P., and Philippe, H. 1999. Where is the root of the universal

tree of life? Bioessays 21, 871–879.

Francino, M. P., and Ochman, H. 1997. Strand asymmetries in DNA

evolution, Trends Genet. 13, 240–245.

Frank, A. C., and Lobry, J. R. 1999. Asymmetric substitution

patterns: A review of possible underlying mutational or selective

mechanisms, Gene 238, 65–77.

Gold, L. 1988. Posttranscriptional regulatory mechanisms in Escher-

ichia coli, Annu. Rev. Biochem. 57, 199–233.

Gribaldo, S., and Philippe, H. 2002. Ancient phylogenetic relation-

ships, in ‘‘Evolution of Genome Structures,’’ (A. M. Campbell and

S. Karlin, Eds.) Theor. Popul. Biol.

Gromiha, M. M., Oobatake, M., and Sarai, A. 1999. Important amino

acid properties for enhanced thermostability from mesophilic to

thermophilic proteins, Biophys. Chem. 82, 51–67.

Gupta, R. S. 1998. Protein phylogenies and signature sequences: A

reappraisal of evolutionary relationships among archaebacteria,

eubacteria, and eukaryotes, Microbiol. Mol. Biol. Rev. 62,

1435–1491.

Gupta, R. S. 2000. The natural evolutionary relationships among

prokaryotes, Crit. Rev. Microbiol. 26, 111–131.

Gupta, R. S., and Golding, G. B. 1996. The origin of the eukaryotic

cell, Trends Biochem. Sci. 21, 166–171.

Handy, J., and Doolittle, R. F. 1999. An attempt to pinpoint the

phylogenetic introduction of glutaminyl-tRNA synthetase among

bacteria, J. Mol. Evol. 49, 709–715.

Hartl, F. U., and Hayer-Hartl, M. 2002. Molecular chaperones in the

cytosol: from nascent chain to folded protein, Science 295,

1852–1858.

Hartman, H., and Fedorov, A. 2002. The origin of the eukaryotic cell:

A genomic investigation, Proc. Natl. Acad. Sci. USA 99,

1420–1425.

Hayes, J. M. 2000. Lipids as a common interest of microorganisms and

geochemists, Proc. Natl. Acad. Sci. USA 97, 14 033–14 034.

Houry, W. A., Frishman, D., Eckerskorn, C., Lottspeich, F., and

Hartl, F. U. 1999. Identification of in vivo substrates of the

chaperonin GroEL, Nature 402, 147–154.

Hunter, C. A. 1993. Sequence-dependent DNA structure. The role of

base stacking interactions, J. Mol. Biol. 230, 1025–1054.

Josse, J., Kaiser, A. D., and Kornberg, A. 1961. Enzymatic synthesis of

deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base

sequences in deoxyribonucleic acid, J. Biol. Chem. 236, 864–875.

Heterogeneity of Genome and Proteome Content 389

Karlin, S. 1998. Global dinucleotide signatures and analysis of

genomic heterogeneity, Curr. Opin. Microbiol. 1, 598–610.

Karlin, S. 2001. Detecting anomalous gene clusters and pathogenicity

islands in diverse bacterial genomes, Trends Microbiol. 9,

335–343.

Karlin, S., Brocchieri, L., Mr!aazek, J., Campbell, A. M., and

Spormann, A. M. 1999. A chimeric prokaryotic ancestry of

mitochondria and primitive eukaryotes, Proc. Natl. Acad. Sci.

USA 96, 9190–9195.

Karlin, S., and Burge, C. 1995. Dinucleotide relative abundance

extremes: A genomic signature, Trends Genet. 11, 283–290.

Karlin, S., Burge, C. and Campbell, A. M. 1992. Statistical analyses of

counts and distributions of restriction sites in DNA sequences,

Nucleic Acids Res. 20, 1363–1370.

Karlin, S., and Cardon, L. R. 1994. Computational DNA sequence

analysis, Annu. Rev. Microbiol. 48, 619–654.

Karlin, S., Doerfler, W., and Cardon, L. R. 1994. Why is CpG

suppressed in the genomes of virtually all small eukaryotic viruses

but not in those of large eukaryotic viruses? J. Virol. 68, 2889–2897.

Karlin, S., and Mr!aazek, J. 1996. What drives codon choices in human

genes? J. Mol. Biol. 262, 459–472.

Karlin, S., and Mr!aazek, J. 2000. Predicted highly expressed genes of

diverse prokaryotic genomes. J. Bacteriol. 182, 5238–5250.

Karlin, S., Mr!aazek, J., and Campbell, A. M. 1997. Compositional

biases of bacterial genomes and evolutionary implications,

J. Bacteriol. 179, 3899–3913.

Karlin, S., Mr!aazek, J., Campbell, A., and Kaiser, D. 2001.

Characterizations of highly expressed genes of four fast-growing

bacteria, J. Bacteriol. 183, 5025–5040.

Krieg, A. M. 1996. Lymphocyte activation by CpG dinucleotide motifs

in prokaryotic DNA, Trends Microbiol. 4, 73–76.

Krieg, A. M., Yi, A. K., Schorr J., and Davis H. L. 1998. The role of

CpG dinucleotides in DNA vaccines, Trends Microbiol. 6, 23–27.

Kuehn, M. J., Ogg, D. J., Kihlberg, J., Slonim, L. N., Flemmer, K.,

Bergfors, T., and Hultgren, S.J. 1993. Structural basis of pilus

subunit recognition by the PapD chaperone. Science 262,

1234–1241.

Kunkel, T. A. 1992. Biological asymmetries and the fidelity of

eukaryotic DNA replication, Bioessays 14, 303–308.

Leigh, J. A. 2000. Nitrogen fixation in methanogens}the archaeal

perspective, in ‘‘Prokaryotic Nitrogen Fixation: A Model System

for Analysis of a Biological Process,’’ (E. W. Triplett, Ed.),

Horizon Scientific Press, Wymondham, UK.

Lipton, M. S., Pasa-Toli-c, L., Anderson, G. A., Anderson, D. J.,

Auberry, D. L., Battista, J. R., Daly, M. J., Fredrickson,

J., Hixson, K. K., Konstandarithes, H., Conrads, T. P., Masselon,

C., Markillie, L. M., Moore, R. J., Romine, M. F., Shen, Y., Toli-c,

N., Udseth, H. R., Veenstra, T. D., Venkateswaran, A., Wong,

K.-K., Zhao, R., and Smith, R. D. 2002. Global analysis of

Deinococcus radiodurans R1 proteomes using accurate mass tags

(submitted).

Lobry, J. R. 1996a. Asymmetric substitution patterns in the two DNA

strands of bacteria, Mol. Biol. Evol. 13, 660–665.

Lobry, J. R. 1996b. Origin of replication of Mycoplasma genitalium,

Science 272, 745–746.

Lopez, P., Forterre, P., and Philippe, H. 1999. The root of the tree of

life in the light of the covarion model, J. Mol. Evol. 49, 496–508.

Lopez-Garcia, P., and Moreira, D. 1999. Metabolic symbiosis at the

origin of eukaryotes, Trends Biochem. Sci. 24, 88–93.

Ma, J., Campbell, A., and Karlin, S. 2002. Correlations between

Shine–Dalgarno sequences and predicted gene expression levels

and operon features, J. Bacteriol. (submitted).

Martin, W., and M .uuller, M. 1998. The hydrogen hypothesis for the

first eukaryote, Nature 392, 37–41.

Mr!aazek, J., and Karlin, S. 1998. Strand compositional asymmetry in

bacterial and large viral genomes, Proc. Natl. Acad. Sci. USA 95,

3720–3725.

Ochman, H., Lawrence, J. G., and Groisman, E. A. 2000. Lateral gene

transfer and the nature of bacterial innovation, Nature 405, 299–304.

Poole, A., Jeffares, D., and Penny D. 1999. Early evolution:

prokaryotes, the new kids on the block, Bioessays 21, 880–889.

Powis, G., and Montfort, W. R. 2001. Properties and biological

activities of thioredoxins, Annu. Rev. Biophys. Biomol. Struct. 30,

421–455.

Richardson, J. S., and Richardson, D. C. 1988. Amino acid preferences

for specific locations at the ends of alpha helices, Science 240,

1648–1652.

Rivera, M. C., Jain, R., Moore, J. E., and Lake, J. A. 1998. Genomic

evidence for two functionally distinct gene classes, Proc. Natl.

Acad. Sci. USA 95, 6239–6244.

Ritz, D., and Beckwith, J. 2001. Roles of thiol-redox pathways in

bacteria, Annu. Rev. Microbiol. 55, 21–48.

Rocha, E. P. C., Danchin, A., and Viari, A. 1999. Universal replication

biases in bacteria, Mol. Microbiol. 32, 11–16.

Rocha, E. P. C., Danchin, A., and Viari, A. 2001. Evolutionary role of

restriction/modification systems as revealed by comparative

genome analysis, Genome Res. 11, 946–958.

Russell, G. J., McGeoch, D. J., Elton, R. A., and Subak-Sharpe, J. H.

1976. Doublet frequency analysis of bacterial DNAs, J. Mol. Evol.

2, 277–292.

Russell, G. J., and Subak-Sharpe, J. H. 1977. Similarity of the

general designs of protochordates and invertebrates, Nature 266,

533–536.

Sandler, S. J., Satin, L. H., Samra, H. S., and Clark, A. J. 1996. recA-

like genes from three archaean species with putative protein

products similar to Rad51 and Dmc1 proteins of the yeast

Saccharomyces cerevisiae, Nucleic Acids Res. 24, 2125–2132.

Sengupta, J., Agrawal, R. K., and Frank, J. 2001. Visualization of

protein S1 within the 30S ribosomal subunit and its inter

action with messenger RNA, Proc. Natl. Acad. Sci. USA 98,

11,991–11,996.

Siegert, R., Leroux, M. R., Scheufler, C., Hartl, F. U., and Moarefi, I.

2000. Structure of the molecular chaperone prefoldin: Unique

interaction of multiple coiled coil tentacles with unfolded proteins,

Cell 103, 621–632.

Suyama, M., and Bork, P. 2001. Evolution of prokaryotic gene order:

Genome rearrangements in closely related species, Trends Genet.

17, 10–13.

Tatusov, R. L., Natale, D. A., Garkavtsev, I. V., Tatusova, T. A.,

Shankavaram, U. T., Rao, B. S., Kiryutin, B., Galperin, M. Y.,

Fedorova, N. D., and Koonin, E. V. 2001. The COG database: new

developments in phylogenetic classification of proteins from

complete genomes, Nucleic Acids Res. 29, 22–28.

Teter, S. A., Houry, W. A., Ang, D., Tradler, T., Rockabrand, D.,

Fischer, G., Blum, P., Georgopoulos, C., and Hartl, F. U.

1999. Polypeptide flux through bacterial Hsp70: DnaK co-

operates with trigger factor in chaperoning nascent chains, Cell

97, 755–765.

Thomas, D. C., Svoboda, D. L., Vos, J. M. H., and Kunkel, T. A.

1996. Strand specificity of mutagenic bypass replication of DNA

containing psoralen monoadducts in a human cell extract, Mol.

Cell. Biol. 16, 2537–2544.

VanBogelen, R. A., Abshire, K. Z., Pertsemlidis, A., Clark, R. L., and

Neidhardt, F. C. 1996. Gene-protein database of Escherichia coli

Karlin et al.390

K-12, edition 6, in ‘‘Escherichia coli and Salmonella: Cellular and

Molecular Biology,’’ 2nd ed. (F. C. Neidhardt, R. Curtiss III, J. L.

Ingraham, E. C. C. Lin, K. B. Low, B. Magasanik, W. S.

Reznikoff, M. Riley, M. Schaechter and H. E. Umbarger, Eds.),

pp. 2067–2117, ASM Press, Washington, D.C.

VanBogelen, R. A., Schiller, E. E., Thomas, J. D., and Neidhardt, F.

C. 1999. Diagnosis of cellular states of microbial organisms using

proteomics, Electrophoresis 20, 2149–2159.

Warner, J. R. 1999. The economics of ribosome biosynthesis in yeast,

Trends Biochem. Sci. 24, 437–440.

Winkler, H. H., and Daugherty, R. M. 1986. Acquisition of glucose by

Rickettsia prowazekii through the nucleotide intermediate uridine

5’-diphosphoglucose, J. Bacteriol. 167, 805–808.

Woese, C. R., Kandler, O., and Wheelis M. L. 1990. Towards a natural

system of organisms: Proposal for the domains Archaea, Bacteria,

and Eucarya, Proc. Natl. Acad. Sci. USA 87, 4576–4579.

Woese, C. 1998. The universal ancestor, Proc. Natl. Acad. Sci. USA 95,

6854–6859.

Wolf, Y. I., Aravind, L., Grishin, N. V., and Koonin, E. V. 1999.

Evolution of aminoacyl-tRNA synthetases}analysis of unique

domain architectures and phylogenetic trees reveals a complex

history of horizontal gene transfer events, Genome Res. 9, 689–710.

Zillig, W., Klenk, H. P., Palm, P., Puhler, G., Gropp, F., Garrett, R.

A., and Leffers, H. 1989. The phylogenetic relations of DNA-

dependent RNA polymerases of archaebacteria, eukaryotes, and

eubacteria, Can. J. Microbiol. 35, 73–80.