uc davis eve161 lecture 11 by @phylogenomics
TRANSCRIPT
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Lecture 10:
EVE 161:Microbial Phylogenomics
!Lecture #10:
Era III: Genome Sequencing !
UC Davis, Winter 2014 Instructor: Jonathan Eisen
!1
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Where we are going and where we have been
• Previous lecture: !10: Genome Sequencing
• Current Lecture: !11: Genome Sequencing II
• Next Lecture: !12: Genome Sequencing III
!2
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Comparative Genomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Structural Diversity
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Structural Diversity
In many organisms, there is a clear distinction in size between the chromosomes andthe plasmids. However, as more complete genome sequences are determined, size is nolonger an infallible criterion for distinguishing different types of genetic elements. Forexample, the halophilic archaeon Haloferax volcanii has five circular DNA elements withsizes of 2.92 Mb, 690 kb, 442 kb, 86 kb, and 6.4 kb (other examples are in Table 7.1). Ifsize were the only criterion, we might consider the 690-kb element in H. volcanii to bea second chromosome because it is larger than the chromosome of B. aphidicola APS,for example. However, size is just a property that helps distinguish plasmids from chro-mosomes. More importantly, there are significant biological differences between plasmidsand chromosomes. Some of these differences are discussed in the following sections.
Plasmids, unlike chromosomes, are generally “accessory” elements, carrying genesthat are required only under certain conditions (Table 7.2). For example, the B. aphidi-cola APS plasmids encode genes needed to synthesize tryptophan and leucine, two ofthe amino acids that the bacteria provide for their host. The B. aphidicola APS chro-mosome encodes all the information for DNA replication, transcription, translation,cell-membrane and cell-wall formation, and the other genes required to assemble thecore machinery of the cell. In E. coli O157:H7, the 92-kb plasmid encodes many viru-lence factors that contribute to the disease caused by this bacterium, whereas the chro-mosome encodes all the housekeeping functions. Because plasmids typically have only
170 Part I I • THE ORIGIN AND DIVERSIFICATION OF LIFE
TABLE 7.1. Examples of bacteria with multiple genetic elements
Species Form Size (kb) Shape
Streptomyces coelicolor Chromosome 8667 LinearPlasmid 356 LinearPlasmid 31 Circular
Agrobacterium tumefaciens Chromosome 2842 CircularChromosome 2057 LinearPlasmid 543 CircularPlasmid 214 Circular
Borrelia burgdorferi Chromosome 911 LinearPlasmid (n = 11) 9–54 Circular/Linear
Brucella melitensis Chromosome 2117 CircularChromosome 1178 Circular
Clostridium acetobutylicum Chromosome 3941 CircularPlasmid 192 Circular
Deinococcus radiodurans Chromosome 2649 CircularPlasmid 412 CircularPlasmid 177 CircularPlasmid 46 Circular
Ralstonia solanacearum Chromosome 3716 CircularChromosome? 2095 Circular
Salmonella typhi Chromosome 4809 CircularPlasmid 218 CircularPlasmid 107 Circular
Sinorhizobium meliloti Chromosome 3654 CircularPlasmid 1683 CircularPlasmid 1354 Circular
Vibrio cholerae Chromosome 2941 CircularChromosome 1072 Circular
Yersinia pestis Chromosome 4654 CircularPlasmid (n = 3) 10–96 Circular
Based on Bentley S.D. and Parkhill J. Annu. Rev. Genet. 38: 771–792, as adapted from Ohmachi M. 2002.Curr. Biol. 12: R427–428.
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 170
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
What is a PlasmidChapter 7 • BACTERIAL AND ARCHAEAL GENETICS AND GENOMICS 171
accessory functions, an organism can usually survive without them provided it is notexposed to the specialized conditions for which the plasmids are needed. In turn, thismeans that plasmids are commonly lost from particular bacterial and archaeal strains.
In most species, only one copy of the chromosome (or, at most, a few) is present percell. Plasmids, however, are frequently present in much greater copy number; sometimesthere are hundreds of copies per cell. Allowing the copy number of plasmids to increase(while controlling chromosome copy number) in essence means that all of the genes onthe plasmids undergo substantial gene duplication. For example, in B. aphidicola APS, theratio of tryptophan and leucine plasmid number to chromosome number is greater than10:1. This difference in copy number arises because plasmid and chromosomal replica-tion are not coupled. Furthermore, the two frequently use entirely separate replicationmechanisms. In addition, because plasmids and chromosomes use different replicationsystems, they frequently have different rates and patterns of mutation.
From an evolutionary point of view, the most important distinction between plas-mids and chromosomes is the ease with which plasmids move between strains andeven species. The mobility of plasmids plays a critical role in lateral gene transfer (seebelow). This transfer of plasmids results in very sporadic plasmid distribution patternswhen different strains of one species or different species are compared.
In almost all species, there is only one chromosome and all other genetic elementsare plasmids. There are, however, a few notable exceptions of bacteria with more thanone chromosome. The causative agent of cholera, Vibrio cholerae, has two large geneticelements (2.9 and 1.1 Mb in size; see Table 7.1). Both encode multiple housekeepinggenes and are found in all close relatives of this species (and thus do not have sporadicdistribution patterns). Therefore, both elements qualify as chromosomes.
Agrobacterium tumefaciens, which causes crown gall tumors in plants, has an un-usual pair of chromosomes: One is circular (as is typical for bacteria), but the other islinear. Once thought to be the exclusive province of eukaryotes, linear genetic elementshave now been found in several species of bacteria.
Linear chromosomes are faced with a unique problem: DNA polymerases cannotreplicate the ends of the chromosome, because the enzymes cannot replace the terminalRNA primer of the lagging strand (see Box 12.1). Without another mechanism for repli-cating the ends (i.e., the telomeres), linear chromosomes would become shorter with eachround of replication. Eukaryotes use a specialized enzyme, telomerase, which adds a re-peating DNA motif to the telomeres (see Fig. 8.17). Bacteria, like A. tumefaciens, appear
TABLE 7.2. Plasmid functions
Genetic Functionof Plasmid Gene Functions Examples
Resistance Antibiotic resistance Rbk plasmid of Escherichia coli and otherbacteria
Fertility Conjugation and DNA F plasmid of E. colitransfer
Killer Synthesis of toxins that Col plasmids of E. coli, for colicin productionkill other bacteria
Degradative Enzymes for TOL plasmid of Pseudomonas putida, formetabolism of toluene metabolismunusual molecules
Virulence Pathogenicity Ti plasmid of Agrobacterium tumefaciens,conferring the ability to cause crown galldisease on dicotyledonous plants
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 171
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Genome Size
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Genome Size
to use a similar mechanism for preserving the ends of their linear chromosomes. Fur-thermore, it appears that these replication systems arose independently in bacteria andeukaryotes and, thus, are an interesting example of convergent evolution.
Bacterial and Archaeal Genomes Are Much Smaller and MoreCompact Than Those of Eukaryotes
Bacterial and archaeal genomes are smaller than the vast majority of eukaryoticgenomes (Fig. 7.1). Among bacteria, genomes range in size from 160 kb (the obligatesymbiont Carsonella ruddii) to more than 13 Mb (the δ-proteobacterium Sorangiumcellulosum). Archaeal genomes range from 490 kb (Nanoarchaeum equitans, a symbi-otic species [Fig. 6.7]) to 5.7 Mb (the methanogen Methanosarcina acetivorans). Themedian genome size for both archaea and bacteria is approximately 2 Mb.
When comparing bacteria and archaea with eukaryotes, the difference in genome sizeis much greater than the difference in the number of genes. This is because the densityof genes is very great within bacterial and archaeal genomes (Fig. 7.2). For example, thehuman genome is approximately 1000 times bigger than the E. coli K12 genome, yet hu-mans have only about ten times as many protein-coding genes (Fig. 7.3). In fact, a num-ber of bacteria and archaea have more protein-coding genes than some eukaryotes. Al-most all species of Myxobacteria (a subgroup of fruiting-body-forming δ-proteobacteria,including S. cellulosum with the 13-Mb genome) have greater than 8000 protein-codinggenes, which is more than in the model yeast species Saccharomyces cerevisiae andSchizosaccharomyces pombe.
The great density of genes within the genomes of bacteria and archaea is due to thepaucity of noncoding DNA compared with that in eukaryotic genomes. Introns and in-tergenic regions (i.e., the DNA located between genes) are rare and generally small inbacteria and archaea. Instead, as mentioned in Chapter 6, many bacterial and archaealgenes are organized into operons, clusters of cotranscribed genes that use only a singlepromoter for the entire gene cluster. This organization helps to create a compact genome.The genes found in a single operon are usually involved in similar functions (e.g., thesame metabolic pathway or a single-protein complex) (Fig. 7.4). Operons are a criticalfeature of the genomes of bacteria and archaea. For example, it is estimated that E. coliK12 has about 700 operons in its genome.
Eukaryotic genomes are bulky in part because they contain large numbers of repeti-tive DNA elements (Fig. 7.2). Common eukaryotic repetitive DNA elements include sim-
172 Part I I • THE ORIGIN AND DIVERSIFICATION OF LIFE
Bacteria
Arabidopsisthaliana
LeishmaniamajorGuillardia theta
P. marius
Nanoarchaeumequitans
Methanosarcinaacetivorans
Myxobacteria
Bradyrhizobiumjaponicum
Escherichiacoli
Human Fern
CockroachMoss Amoebadubia
Schizosac-charomyces
pombe
Parameciumtetraurelia
Number of base pairs
Eukaryotes
Archaea
1 10131 10121 10111 10101 1091 1081 1071 1061 105
FIGURE 7.1. Genome sizes in the three domains of life. A selection of genome sizes and sizeranges from specific groups of organisms is indicated.
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 172
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene Density
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene DensityChapter 7 • BACTERIAL AND ARCHAEAL GENETICS AND GENOMICS 173
ple sequence repeats (e.g., microsatellites and minisatellites), gene duplications (both tan-dem arrays and pseudogenes), and transposable elements. Although bacterial and ar-chaeal genomes contain repetitive DNA, the total amount is relatively small. For exam-ple, hundreds of thousands of copies of transposable elements are present in manyeukaryotic genomes, yet in bacteria and archaea it is rare to have even 100 copies.
Pressure to Streamline Genomes Causes Bacteria and Archaeato Lose Genes Not Actively Maintained by Selection
To understand the evolution of bacterial and archaeal genomes, it is useful to ask whythere is so much more noncoding DNA in most eukaryotic genomes. Clearly, some ofthe extra DNA in eukaryotes has important functions such as gene regulation. However,much of the noncoding DNA in eukaryotic genomes has been classified as either junkDNA or selfish DNA. Junk DNA appears to provide little benefit or no function to theorganism. (In some cases this designation is a misnomer resulting from a lack of infor-
0 10
Gene
20 30 40 50 kb
A Human
B Escherichia coli
Human pseudogeneKEY
Repetitive DNA element
0 10 20 30 40 50 kb
FIGURE 7.2. Genome density. Comparison of the genome density and content of humans and Es-cherichia coli. Each segment is 50 kb in length and represents (A) a portion of the human β T-cellreceptor locus and (B) a region of the E. coli K12 genome. Note the much greater proportion ofgenes (red boxes) in E. coli compared to humans.
30,00025,00020,00015,00010,0005,000
0
Bacteria
Genes
Genome size105 106 107 108 109 1010
EukaryotesVirusesArchaea
FIGURE 7.3. Genome size vs. number of protein-coding genes. The number of genes is highly cor-related to genome size for bacteria, archaea, and viruses, but less so for eukaryotes. Many archaealpoints (blue triangles) are hidden under bacterial ones (yellow squares).
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 173
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Number of genes
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Number of Genes
Chapter 7 • BACTERIAL AND ARCHAEAL GENETICS AND GENOMICS 173
ple sequence repeats (e.g., microsatellites and minisatellites), gene duplications (both tan-dem arrays and pseudogenes), and transposable elements. Although bacterial and ar-chaeal genomes contain repetitive DNA, the total amount is relatively small. For exam-ple, hundreds of thousands of copies of transposable elements are present in manyeukaryotic genomes, yet in bacteria and archaea it is rare to have even 100 copies.
Pressure to Streamline Genomes Causes Bacteria and Archaeato Lose Genes Not Actively Maintained by Selection
To understand the evolution of bacterial and archaeal genomes, it is useful to ask whythere is so much more noncoding DNA in most eukaryotic genomes. Clearly, some ofthe extra DNA in eukaryotes has important functions such as gene regulation. However,much of the noncoding DNA in eukaryotic genomes has been classified as either junkDNA or selfish DNA. Junk DNA appears to provide little benefit or no function to theorganism. (In some cases this designation is a misnomer resulting from a lack of infor-
0 10
Gene
20 30 40 50 kb
A Human
B Escherichia coli
Human pseudogeneKEY
Repetitive DNA element
0 10 20 30 40 50 kb
FIGURE 7.2. Genome density. Comparison of the genome density and content of humans and Es-cherichia coli. Each segment is 50 kb in length and represents (A) a portion of the human β T-cellreceptor locus and (B) a region of the E. coli K12 genome. Note the much greater proportion ofgenes (red boxes) in E. coli compared to humans.
30,00025,00020,00015,00010,0005,000
0
BacteriaGenes
Genome size105 106 107 108 109 1010
EukaryotesVirusesArchaea
FIGURE 7.3. Genome size vs. number of protein-coding genes. The number of genes is highly cor-related to genome size for bacteria, archaea, and viruses, but less so for eukaryotes. Many archaealpoints (blue triangles) are hidden under bacterial ones (yellow squares).
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 173
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene Arrangement
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Operons174 Part I I • THE ORIGIN AND DIVERSIFICATION OF LIFE
lacZ
CAPsite
Operator
Promoter
Lactose permeasetransports lactose intothe cell
transacetylase+split lactose to galactose + glucose
CH2OHOH
OHH H
H
H OH
H
O
O
-galactosidase
lacY lacA
OH
H
CH2OHOH
OHH
H
H OH
H
O OH
H
CH2OHH
OHOH
H
H OH
H
O OH
H
CH2OHH
OHH H
H
H OH
Lactose Galactose
+
Glucose
H
O
FIGURE 7.4. Lac operon from Escherichia coli. This operon consists of three genes whose transcrip-tion is regulated by a single promoter. The genes encode proteins involved in utilizing lactose, in-cluding a permease (encoded by lacY), which brings lactose into the cell from the outside, and twoenzymes (encoded by lacZ and lacA), which split lactose into glucose + galactose (see pp. 52–53).
mation. Some stretches of “junk DNA” have been determined to be involved in gene reg-ulation, chromatin organization, centromere activity, and other functions.) Selfish DNAis composed of mobile DNA elements that facilitate their own duplication, even if it isto the detriment of the host.
All of the many theories that have been proposed to explain why junk DNA and selfishDNA are less abundant in bacteria and archaea agree that there is some global pressure tokeep total genome size small. This global pressure is most likely selection, although theremay also be a bias toward deletion of DNA. Indeed, such a mechanism in bacteria and ar-chaea could be responsible for keeping introns both small and rare, holding transposable el-ements in check, maintaining operons, and culling junk DNA. This global pressure and othertheories on the evolution of genome size are discussed in more detail in Chapter 21. Herewe discuss its effects on the general patterns of genomic evolution in bacteria and archaea.
The limited occurrence of introns in bacteria and archaea has many importantconsequences. For example, although eukaryotes can make thousands of protein prod-ucts from a single gene by alternative splicing, this is not seen in bacteria and archaea.In addition, mixing and matching of protein domains is less common in bacteria andarchaea than in eukaryotes, possibly because such events are caused mainly by re-combination in introns.
The extensive use of operons also has significant consequences. In some respects, oper-ons are a major constraint; mutations that break up the operon (e.g., by causing a re-arrangement in the middle of the operon) may be quite detrimental. In other ways, oper-ons facilitate rapid acquisition of new features by bacteria and archaea because they allowcomplete pathways to be transferred readily between strains or species (see the discussionof lateral transfer later in this chapter). In contrast, in many eukaryotes, with genes in-volved in the same pathway scattered around the genome, it is unlikely that all of the geneswould be transferred at one time to another strain or species.
In bacteria and archaea, the pressure to streamline genomes (whether caused bymutation bias or selection for small genomes or both) means that genes that provideno advantage are rapidly lost (see Box 18.2). Thus, although vestigial genes may lingerfor long periods in eukaryotes, they do not linger in bacteria and archaea. For exam-
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 174
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene Content
!15
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Shared Genes
!16
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
E. coli shared Genes
substantial variation in gene content among members of the same species have beenreported in other lineages of bacteria and archaea. Thus, the diminishing number ofcore orthologous genes is simply an extension of something happening among closerelatives.
How do such extensive differences in gene content among close relatives originate?One of the most important clues comes from comparing the genome structures of re-lated species. (A graphical method for aligning circular genomes is introduced in Box7.1; see Figs. 7.8 and 7.9.) In comparing E. coli K12 and O157:H7, the genes that areshared between the two strains not only are highly conserved at the sequence level; but
176 Part I I • THE ORIGIN AND DIVERSIFICATION OF LIFE
Graphical Alignment for Comparing Circular Genomes Using Dotplots
Comparing the arrangement of genomes is a critical tool for un-derstanding how they evolve. This enables scientists to identifyand characterize genome rearrangements (e.g., inversions andtranslocations) and to search for patterns and associations thatmay explain how and why certain events occur. For example,differences in gene order between species are frequently at siteswhere repetitive DNA is found, which suggests that recombina-tion at the repetitive DNA may have led to rearrangements. Oneof the more useful methods compares two genomes on an x–yplot, a procedure commonly referred to as a dotplot.
Dotplots let people use their visual pattern-recognitionskills to identify similarities. Their power and simplicityhave made them a valuable analytical tool in fields beyondbiology, including electrical engineering and computer sci-
ence. Let us illustrate the method using some text-based ex-amples. Figure 7.8A plots a familiar quotation against itself.The central diagonal line is the axis of identity. The outlyingpoints represent text that repeats. A quick examination candistinguish a pattern that is repeated in its entirety (Fig.7.8B) from one with some unique elements (Fig. 7.8C).
Because most bacterial and archaeal chromosomes are cir-cular, a chromosome must first be “opened” before laying it outon the x- or y-axis. Although the circle can be linearized at anypoint, it is preferable to open each chromosome at its origin ofreplication (Fig. 7.9A). One linearized chromosome is thenaligned along the x-axis with the origin of replication placed atthe graphical origin. The other chromosome is similarlyarranged along the y-axis. The two chromosomes are com-
MG1655 (K-12)nonpathogenic
EDL933 (0157:H7)enterohemorrhagic
585
514 204
1932996
1346
1623CFT073uropathogenic
FIGURE 7.7. Number of shared proteins be-tween strains of Escherichia coli. Note thelarge number of genes found in one strainbut not the others (seen in the outer portionsof each circle).
be
A B C
to
not
or
be
to
to be or not to be
edcbaedcba
a b c d e a b c d e
ed
zy
bc
aedcba
a b c d e a b c y z d e
FIGURE 7.8. Dotplots of repeating text.
Box 7.1
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 176
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene Order
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene Order
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Origin of replication
Terminus of replication
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Origin of replication
Terminus of replication
Artificially Open Circle
Origin Terminus Origin Again
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Origin of replication
Terminus of replication
Artificially Open Circle
Origin Terminus Origin Again
Genome 1
Gen
ome
2
O T O
O
O
T
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
E. coli K12
Island InversionRepeat
E.coli0157:
H7
FIGURE 7.10. Conserved gene order inthe backbone of Escherichia coli K12 and0157:H7. The two genomes were alignedwith each other and the matching regionswere plotted. The conserved order ofgenes in the backbone of the two E. colistrains is indicated by the diagonal line.Three important genomic regions are cir-cled. An island present in one of the twostrains causes a slight shift in the positionof the main diagonal.
178 Part I I • THE ORIGIN AND DIVERSIFICATION OF LIFE
they also occur in virtually the same order in both strains (Fig. 7.10). The genes uniqueto each strain are clustered into “islands” interspersed among the stretches of commongenes. Similar patterns of DNA “islands” within a conserved genome backbone havebeen found among other related bacteria or archaea.
How do these islands originate? These are two possibilities: insertion of DNA intothe strain with the island or deletion of DNA in the strain without the island. Gene lossis very common and frequently very rapid in bacteria and archaea (e.g., Fig. 7.5). How-ever, relying on gene loss alone to explain genomic islands becomes untenable as moreand more species are compared. For example, when the genome of a third strain of E.coli was determined, it was found to have many additional islands that are absent fromboth K12 and O157:H7 (Fig. 7.7). For gene loss to explain all the islands in the variousE. coli strains, their common ancestor would have required an enormous genome fromwhich different regions were lost in different lineages. Indeed, such a mechanism wouldrequire ancestral species to have had bigger and bigger genomes further back in time.Thus genes must be acquired to offset gene loss. Acquisition of genes is one of the hall-marks of bacterial and archaeal evolution and is discussed on pages 182–191.
Gene Order Changes Rapidly but with Strong Constraints
In addition to studying the location of genes found in one organism but not another,it is useful to compare the order of genes and other genomic features that are con-served between species. These comparisons reveal how genomes evolve and what theconstraints are on the relative positioning of genes.
As with gene content, there is little conservation in the gene order between dis-tantly related species (Fig. 7.11). Some sets of genes, however, are strongly conserved.The best example is the genes that encode many of the ribosomal proteins (Fig. 7.12).When such conservation occurs across such large evolutionary distances, it suggeststhat tightly coordinated regulation of transcription and translation is necessary forfunctionality. This is probably due in part to the coupling of transcription and trans-lation in bacteria and archaea. In turn, the lack of coupling in eukaryotes may explainwhy there are few examples of gene-order conservation across such large distances.
When gene-order comparisons are made among closely related strains or species, re-arrangements are frequently observed at sites of repetitive sequences such as transposonsor duplicated genes (Fig. 7.13). Although repetitive DNA is less abundant in bacteria andarchaea, it still plays a major role in genome evolution.
Comparing gene order among multiple sets of close relatives has revealed what typesof rearrangements are most common. In bacteria and archaea, one of the most com-
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 178
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Chapter 7 • BACTERIAL AND ARCHAEAL GENETICS AND GENOMICS 179
mon is symmetric inversion around the origin of replication (Fig. 7.14). Such inversionsare seen in almost every comparison of moderately closely related strains or species. Al-though other rearrangements occur, the symmetric inversions serve as a useful tool forunderstanding some features of general evolution and we focus on them here.
Symmetric inversions around the origin are due to a combination of mutation biasand selection bias. To understand how mutation bias could cause this, it is helpful to un-derstand some of the features of circular chromosome replication in bacteria and archaea.Replication of circular chromosomes almost always begins at a single region—referred toas the origin of replication. DNA replication proceeds bidirectionally from this origin, con-tinuing until the replication forks collide on the other side of the DNA circle at the ter-minus of replication (Fig. 7.15). It is thought that the replication complex stands relativelystill and the DNA is threaded through this complex, which would place the two replica-tion forks close to each other. This threading can thus lead to symmetric inversions. If theDNA replication complexes were to slip and drop the DNA strands, they might restartreplication by using the template from the opposite side of the origin to extend the re-cently replicated DNA, thereby causing an inversion. As the two replication forks should
400,0000400,000
0
800,0001,200,0001,600,0001,667,867
800,000H. influenzae Rd chromosome
H.pylori266
95chromoso
me
1,200,000 1,830,137
FIGURE 7.11. The lack of conservation ofgene order between Haemophilus influen-zae and Helicobacter pylori is illustrated.Linearized chromosomes of H. influenzaeand H. pylori are plotted on the horizontaland vertical axes, respectively. Each dot rep-resents a single pair of orthologous proteins.Genes in similar operons, which do exist,are too close together to give separatedpoints on the scale used.
Sinorhizobium melilotiBacillus subtilisBorrelia burgdorferiTreponema pallidumHelicobacter pyloriEscherichia coliHaemophilus influenzaeRickettsia prowazekiiMycoplasma sp.Aquifex aeolicus S6Thermatoga maritimaDeinococcus radioduransMycobacterium tuberculosisChlamydia sp.Synechocystis
Archaea SUI1-X1 S-4E L32-L19 X2 cdk-L1--ccm-mms
Small SUr-protein genes
rpoBC str S10 spc alpha
Large SUr-protein genesNonribosomal genesUnknown genesBreakpointGene insertionRho-independent terminatorMissing gene
S4
?
L11(rplK)
L1(rplA)
L10(rplJ)
L7/L12(rplL)
rpoB rpoC unknown
S12(rpsL)
S7(rpsG)
fusA tufA S10(rpsJ)
L3(rplC)
L4(rplD)
L23(rplW)
L2(rplB)
S19(rpsS)
L22(rplY)
S3(rpsC)
L16(rplP)
L29(rpmC)
S17(rpsQ)
L14(rplN)
L24(rplX)
L5(rplE)
S14(rpsN)
S8(rpsH)
L6(rplF)
L18(rplR)
S5(rpsE)
L30(rpmD)
L15(rplO)
secY adk map infA L36(rpmJ)
S13(rpsM)
S11(rpsK)
S4(rpsD)
rpoA L17(rplQ)
xxx
? ???
FIGURE 7.12. Conservation of gene order of ribosomal protein operons across bacterial and ar-chaeal species.
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 179
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Chapter 7 • BACTERIAL AND ARCHAEAL GENETICS AND GENOMICS 179
mon is symmetric inversion around the origin of replication (Fig. 7.14). Such inversionsare seen in almost every comparison of moderately closely related strains or species. Al-though other rearrangements occur, the symmetric inversions serve as a useful tool forunderstanding some features of general evolution and we focus on them here.
Symmetric inversions around the origin are due to a combination of mutation biasand selection bias. To understand how mutation bias could cause this, it is helpful to un-derstand some of the features of circular chromosome replication in bacteria and archaea.Replication of circular chromosomes almost always begins at a single region—referred toas the origin of replication. DNA replication proceeds bidirectionally from this origin, con-tinuing until the replication forks collide on the other side of the DNA circle at the ter-minus of replication (Fig. 7.15). It is thought that the replication complex stands relativelystill and the DNA is threaded through this complex, which would place the two replica-tion forks close to each other. This threading can thus lead to symmetric inversions. If theDNA replication complexes were to slip and drop the DNA strands, they might restartreplication by using the template from the opposite side of the origin to extend the re-cently replicated DNA, thereby causing an inversion. As the two replication forks should
400,0000400,000
0
800,0001,200,0001,600,0001,667,867
800,000H. influenzae Rd chromosome
H.pylori266
95chromoso
me
1,200,000 1,830,137
FIGURE 7.11. The lack of conservation ofgene order between Haemophilus influen-zae and Helicobacter pylori is illustrated.Linearized chromosomes of H. influenzaeand H. pylori are plotted on the horizontaland vertical axes, respectively. Each dot rep-resents a single pair of orthologous proteins.Genes in similar operons, which do exist,are too close together to give separatedpoints on the scale used.
Sinorhizobium melilotiBacillus subtilisBorrelia burgdorferiTreponema pallidumHelicobacter pyloriEscherichia coliHaemophilus influenzaeRickettsia prowazekiiMycoplasma sp.Aquifex aeolicus S6Thermatoga maritimaDeinococcus radioduransMycobacterium tuberculosisChlamydia sp.Synechocystis
Archaea SUI1-X1 S-4E L32-L19 X2 cdk-L1--ccm-mms
Small SUr-protein genes
rpoBC str S10 spc alpha
Large SUr-protein genesNonribosomal genesUnknown genesBreakpointGene insertionRho-independent terminatorMissing gene
S4
?L11(r
plK)L1(rp
lA)L10(r
plJ)L7/L12
(rplL)rpoB rpoC unkno
wnS12(r
psL)S7(rp
sG)fusA tufA S10(r
psJ)L3(rp
lC)L4(rp
lD)L23(r
plW)L2(rp
lB)S19(r
psS)L22(r
plY)S3(rp
sC)L16(r
plP)L29(r
pmC)S17(r
psQ)L14(r
plN)L24(r
plX)L5(rp
lE)S14(r
psN)S8(rp
sH)L6(rp
lF)L18(r
plR)S5(rp
sE)L30(r
pmD)L15(r
plO)secY adk map infA L36(r
pmJ)S13(r
psM)S11(r
psK)S4(rp
sD)rpoA L17(r
plQ)
xxx
? ???
FIGURE 7.12. Conservation of gene order of ribosomal protein operons across bacterial and ar-chaeal species.
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 179
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene Order Again
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
V. cholerae vs. E. coli All
0
1000000
2000000
3000000
4000000
5000000E.
col
i Coordinates
0 1000000 2000000 3000000
V. cholerae Coordinates Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
V. cholerae vs. E. coli Best
0
1000000
2000000
3000000
4000000
5000000E.
col
i Coordinates
0 1000000 2000000 3000000
V. cholerae Coordinates Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
V. cholerae vs. E. coli, Rotated
0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0E
. col
i OR
F C
oord
inat
es
0 5 0 0 0 0 0 1 0 0 0 0 0 0 1 5 0 0 0 0 0 2 0 0 0 0 0 0 2 5 0 0 0 0 0 3 0 0 0 0 0 0
V. cholerae ORF Coordinates Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Duplication and Gene Loss Model
Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
V. cholerae vs. E. coliOrthologs on Both Diagonals
0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0E
. col
i OR
F C
oord
inat
es
0 5 0 0 0 0 0 1 0 0 0 0 0 0 1 5 0 0 0 0 0 2 0 0 0 0 0 0 2 5 0 0 0 0 0 3 0 0 0 0 0 0
V. cholerae ORF Coordinates Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014C. trachomatis MoPn
C. p
neum
onia
e A
R39
Origin
Terminus
C. trachomatis vs C. pneumoniae
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
B1
A1
B2
A2
B3
A3
B3
B2
2423
2221
20191817161514
1312
11109
67258
2627
2829
301 2 3
45
3132
B1
3132
6789
1011
1213
14151617181920
2122
2324252627
2829
301 2 3
45
3132
B3 2423
2221
20191817161514
1312
11109
67258
2627
2829
33231 30
45
2 1
A1
3132
6789
1011
1213
14151617181920
2122
2324252627
2829
301 2 3
45
3132
A2
3132
6789
1011
1213
19181716151420
2122
2324252627
2829
301 2 3
45
3132
A3
2
6789
1011
1213
19181716151420
2122
2324252627
54
3 31 3029
28
1 32
B2
Inversion Around Terminus (*)
Inversion Around Terminus (*)
Inversion AroundOrigin (*)
Inversion AroundOrigin (*)
* *
* *
* *
* *
Common Ancestor of
A and B
3132
6789
1011
1213
14151617181920
2122
2324252627
2829
301 2 3
45
3132
A2
A1 A2
A3
B2
B1
Symmetric Inversion Model
Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
13621300
13621775
13622250
13622725
13623200
0 625 1250 1875 2500
Series1
Streps
0
500
1000
1500
2000
2500
3000
2632200 2632700 2633200 2633700 2634200 2634700 2635200 2635700 2636200 2636700
B. subt vs. Staph
0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
Myc
obac
teri
um tu
berc
ulos
is
0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0
Mycobacterium leprae
M. tb vs. M. leprae Pyrococcus Thermoplasmas9945700
9947275
9948850
9950425
9952000
0 2125 4250 6375 8500
Series1
Pseudomonas
The X-Files
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
B C
A
Commonancestor of
A and B
Inversionaround
terminus (*)
Inversionaround
origin (*)
Inversionaround
terminus (*)
Inversionaround
origin (*)
A2
B1 B2 B3
A1 A2
B1 B2
A1 A2 A3
A3
B2 B3
A1 A2 A3
B1 B2 B3
1 23
45
6789
1011
1213
1918171615
1420
2122
2324252627
2829
30 31 32
V. cholerae chromosome IV. cholerae chromosome I
Esch
eric
hia
coli
V.pa
raha
emol
ytic
usch
rom
osom
eI
32 3130
2928
6789
1011
1213
1918171615
1420
2021
2223242526
274
3 2 11 23
45
6789
1011
1213
1415161718
1920
2122
2324252627
2829
30 31 32
1 23
45
6789
1011
1213
1415161718
1920
2122
2324252627
2829
30 31 32 1 23
45
6789
1011
1213
1415161718
1920
2122
2324252627
2829
30 31 32 32 3130
45
67252423
2221
2019
1817161514
1312
1110982627
2829
3 2 1
1 23
45
6789
1011
1213
1415161718
1920
2122
2324252627
2829
30 31 32
FIGURE 7.14. X-alignments. (A) Schematic model of symmetric genome inversions. The modelshows an initial speciation event, followed by a series of inversions in the different lineages (Aand B). Inversions occur between the asterisks (*). Numbers on the chromosome refer to hypo-thetical genes 1–32. At time point 1, the genomes of the two species are still colinear (as indi-cated in the scatterplot of A1 vs. B1). Between time point 1 and time point 2, each species (Aand B) undergoes a large inversion about the terminus (as indicated in the scatterplots of A1 vs.A2 and B1 vs. B2). This results in the between-species scatterplot looking as if there have beentwo nested inversions (A2 vs. B2). Between time point 2 and time point 3, each species under-goes an additional inversion (as indicated in the scatterplots of B2 vs. B3 and A2 vs. A3). This re-sults in the between-species scatterplots beginning to resemble an X-alignment. (B) X-like align-ment in dotplot of the main chromosomes of Vibrio cholerae (x-axis) and Vibrio parahaemolyticus(y-axis). (C) A weak X-like pattern exists even when comparing more distantly related species, inthis case V. cholerae and E. coli. An X-like pattern indicates that the distance of a gene from theorigin is conserved, but the side of the origin on which it is located is not conserved.
Chapter 7 • BACTERIAL AND ARCHAEAL GENETICS AND GENOMICS 181
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 181
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene Loss
Chapter 7 • BACTERIAL AND ARCHAEAL GENETICS AND GENOMICS 175
ple, B. aphidicola APS has undergone a massive reduction in its genome since it shareda common ancestor with E. coli (Fig. 7.5). This symbiont lives inside aphid cells wheremany genes required for the free-living lifestyle of E. coli are not needed.
Gene Content Is in Constant Flux in Bacteria and Archaea
The availability of hundreds of complete genome sequences enables scientists to exam-ine how gene content evolves. The first analysis of this sort was performed using the firsttwo sequenced genomes: M. genitalium and H. influenzae. Despite the fact that bothspecies have very small genomes, hundreds of homologous genes were identified. Theseshared genes were proposed to be the “minimal gene set” of a bacterium; that is, theymight represent the genes that are essential for making a bacterium (Fig. 7.6A).
However, as more genomes from different phylogenetic groups have been se-quenced, the number of “core” homologous genes has diminished (Fig. 7.6B). The rea-son for this became clear when genomes from different strains of the same specieswere compared. This was first done with the pathogenic strain E. coli (O157:H7) andthe E. coli K12 laboratory strain. Although these strains share approximately 4000highly conserved genes, O157:H7 has more than 1000 genes not found in K12, andK12 has approximately 500 genes absent from O157:H7 (Fig. 7.7). Similar patterns of
adk
htpG
recR
ybaB
dnaX
apt pr
iCyb
aM
aefA
acrR ac
rA
acrB
RNA
-ffs
amtB
ybaE
ybaX
ybaW
ybaV
ybaZ
ybaY
tesB
ginK
mdl
B
mdl
A
ybaO
ybaU
hupB
a clpX
clpP
tig bolA
cof
ybaNAncestor
Buchnera 10 kbFIGURE 7.5. Genome reduction in Buchnera endosymbionts of aphids. A fragment of two genomesis shown. (Top row) The putative ancestor of all aphid endosymbionts in the Buchnera genus. (Bot-tom row) The genome of the symbionts today. The massive amounts of gene loss are indicated bythe genes colored white in the ancestral genome that are missing from the modern genome below.Orthologous genes between the two genomes are shown in the same color. Note the conservationof gene order between the two genomes despite the gene loss. The direction of gene transcriptionis indicated by the gene box being shifted above or below the black line.
Mycoplasma genitalium468 genes240 sharedgenes
Haemophilus influenzae1703 genes
BA
80sharedgenes
FIGURE 7.6. (A) Comparison of predicted protein-coding genes in the first two completed genomesHaemophilus influenzae and Mycoplasma genitalium. Approximately 240 genes are shared be-tween the two species. (B) Comparison of the predicted protein-coding genes of the first 25 bac-terial genomes (not all 25 circles are shown). Note that only about 80 genes can be identified asbeing shared among all of these species.
169-194_Evo_Ch07.qxd:13937_C05.qxd 12/15/08 11:05 AM Page 175
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene Duplication
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Why Duplications Are Useful to Identify
• Allows division into orthologs and paralogs !
• Improves functional predictions !
• Helps identify mechanisms of duplication !
• Can be used to study mutation processes in different parts of a genome
!• Lineage specific duplications may be indicative of
species’ specific adaptations
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
C. pneumoniae - All Paralogs
0
250000
500000
750000
1000000
1250000Su
bjec
t Orf
Posit
ion
0 250000 500000 750000 1000000 1250000
Query Orf Position
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
C. pneumoniae Lineage-Specific Paralogs
0
250000
500000
750000
1000000
1250000Su
bjec
t Orf
Posit
ion
0 250000 500000 750000 1000000 1250000
Query Orf Position
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Expansion of MCP Family in V. cholerae
E.coli gi1787690
B.subtilis gi2633766Synechocystis sp. gi1001299
Synechocystis sp. gi1001300Synechocystis sp. gi1652276
Synechocystis sp. gi1652103H.pylori gi2313716H.pylori99 gi4155097C.jejuni Cj1190c
C.jejuni Cj1110cA.fulgidus gi2649560A.fulgidus gi2649548
B.subtilis gi2634254B.subtilis gi2632630B.subtilis gi2635607B.subtilis gi2635608B.subtilis gi2635609
B.subtilis gi2635610B.subtilis gi2635882
E.coli gi1788195E.coli gi2367378E.coli gi1788194
E.coli gi1789453
C.jejuni Cj0144C.jejuni Cj0262c
H.pylori gi2313186H.pylori99 gi4154603
C.jejuni Cj1564
C.jejuni Cj1506cH.pylori gi2313163H.pylori99 gi4154575
H.pylori gi2313179H.pylori99 gi4154599
C.jejuni Cj0019cC.jejuni Cj0951c
C.jejuni Cj0246cB.subtilis gi2633374
T.maritima TM0014
T.pallidum gi3322777T.pallidum gi3322939
T.pallidum gi3322938B.burgdorferi gi2688522T.pallidum gi3322296
B.burgdorferi gi2688521T.maritima TM0429T.maritima TM0918T.maritima TM0023
T.maritima TM1428T.maritima TM1143
T.maritima TM1146P.abyssi PAB1308
P.horikoshii gi3256846P.abyssi PAB1336P.horikoshii gi3256896
P.abyssi PAB2066P.horikoshii gi3258290P.abyssi PAB1026P.horikoshii gi3256884
D.radiodurans DRA00354D.radiodurans DRA0353
D.radiodurans DRA0352P.abyssi PAB1189P.horikoshii gi3258414
B.burgdorferi gi2688621M.tuberculosis gi1666149
V.cholerae VC0512V.cholerae VCA1034
V.cholerae VCA0974V.cholerae VCA0068
V.cholerae VC0825V.cholerae VC0282
V.cholerae VCA0906V.cholerae VCA0979
V.cholerae VCA1056V.cholerae VC1643
V.cholerae VC2161V.cholerae VCA0923
V.cholerae VC0514V.cholerae VC1868
V.cholerae VCA0773V.cholerae VC1313
V.cholerae VC1859V.cholerae VC1413
V.cholerae VCA0268V.cholerae VCA0658
V.cholerae VC1405V.cholerae VC1298
V.cholerae VC1248V.cholerae VCA0864V.cholerae VCA0176
V.cholerae VCA0220V.cholerae VC1289
V.cholerae VCA1069V.cholerae VC2439
V.cholerae VC1967V.cholerae VCA0031V.cholerae VC1898V.cholerae VCA0663
V.cholerae VCA0988V.cholerae VC0216V.cholerae VC0449
V.cholerae VCA0008V.cholerae VC1406
V.cholerae VC1535V.cholerae VC0840
V.cholerae VC0098V.cholerae VCA1092
V.cholerae VC1403V.cholerae VCA1088
V.cholerae VC1394
V.cholerae VC0622
NJ
**
*****
****
**
****
***
****
**
*
****
**
**
**
****
****
***
****
** ****
***
**
*
***
****
**
*
****
*
Heidelberg et al. (2000)
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
After the Genomes
• Better analysis and annotation
• Comparative genomics
• Functional genomics (Experimental analysis of gene function on a genome scale)
• Genome-wide gene expression studies
• Proteomics
• Genome wide genetic experiments