the slow road to the eukaryotic genome

8
The slow road to the eukaryotic genome Leo Lester, Andrew Meade, and Mark Pagel* Summary The eukaryotic genome is a mosaic of eubacterial and archaeal genes in addition to those unique to itself. The mosaic may have arisen as the result of two prokaryotes merging their genomes, or from genes acquired from an endosymbiont of eubacterial origin. A third possibility is that the eukaryotic genome arose from successive events of lateral gene transfer over long periods of time. This theory does not exclude the endosymbiont, but questions whether it is necessary to explain the peculiar set of eukaryotic genes. We use phylogenetic studies and reconstructions of ancestral first appearances of genes on the prokaryotic phylogeny to assess evidence for the lateral gene transfer scenario. We find that phylogenies advanced to support fusion can also arise from a succession of lateral gene transfer events. Our recon- structions of ancestral first appearances of genes reveal that the various genes that make up the eukaryotic mosaic arose at different times and in diverse lineages on the prokaryotic tree, and were not available in a single lineage. Successive events of lateral gene transfer can explain the unusual mosaic structure of the eukaryotic genome, with its content linked to the immediate adaptive value of the genes its acquired. Progress in under- standing eukaryotes may come from identifying ancestral features such as the eukaryotic splicesome that could explain why this lineage invaded, or created, the eukar- yotic niche. BioEssays 28:57–64, 2006. ß 2005 Wiley Periodicals, Inc. Introduction The phylogenetic placement of the eukaryotes among the prokaryotes has been called ‘‘evolution’s sorest spot’’. (1) Almost thirty years ago, Woese suggested a classification system that divided life into three domains. (2) the Eukaryota, the Eubacteria and the Archaea. Phylogenetic studies using ribosomal markers and other essential genes gave support to Woese’s view by showing that the domains were monophy- letic, (3–5) and primordially duplicated genes placed the root of all life within the eubacteria, leaving the archaea and eukaryotes as sister taxa. (6,7) Recent genomic studies have begun to complicate this picture. As further genes have been sequenced, so the position of the eukaryotes has been found to jump. Ribosomal RNA might place the eukaryotes alongside the archaea, but other genes make them sister to the eubacteria. (8) More generally, so-called informational genes, those involved in essential housekeeping duties, return topologies in which the eukaryotes and archaea are sisters, while metabolic genes give trees which place the eukaryotes closer to the eubacteria. (8,9) These alternative phylogenies stem from the unusual nature of the eukaryotic genome: it turns out to be a mosaic of the prokaryotic domains. (10–12) How this mosaic formed is fundamental to theories of the origin and early evolution of the lineage that eventually gave rise to the eukaryotes. One idea proposes that the mosaic genome arose from an ancient fusion event between an archaeon and a eubacterium, (10) possibly deriving from a symbiotic relationship between the two. Whether or not this early fusion ever occurred, a second idea links the mosaic to the endosymbiotic origin of the mitcohondria. (13,14) Theory proposes that the endosymbiont was eubacterial (15) and that, over time, many of its genes transferred to the eukaryotic nucleus. (16) Whether genes were acquired from an endosymbiont or some other source, it has been suggested that the majority coded for metabolic capabilities, (17,18) although more recent work shows that informational genes of apparently eubacterial origin can also be found in the eukaryotic nucleus. (19) Either of the fusion hypotheses can explain the broad features of the eukaryotic mosaic—its complements of eubacterial and archaeal genes—but equally neither excludes a third possibility; this is that myriad lateral gene transfer events among the prokaryotes over long periods of time slowly built up a unique lineage, one that we now recognise as eukaryotic. (20) This slow-drip scenario does not doubt the existence of the endosymbiont, but questions whether it is necessary to invoke it to explain the nuclear mosaic. (17,20) As part of an already functioning organism, many of the endosymbiont’s functions may have been redundant or unnecessary, (22) and it may simply have lost many of its genes. School of Animal and Microbial Sciences, The University of Reading University, UK. Funding agency: Biotechnology and Biological Sciences Research Council (UK); Grant numbers: 45/G14980 and 45/G19848 awarded to Mark Pagel. *Correspondence to: Mark Pagel, The University of Reading Uni- versity, School of Animal and Microbial Sciences, Whiteknights, PO Box 228, Reading RG6 6AJ, UK. E-mail: [email protected] DOI 10.1002/bies.20344 Published online in Wiley InterScience (www.interscience.wiley.com). BioEssays 28:57–64, ß 2005 Wiley Periodicals, Inc. BioEssays 28.1 57 Problems and paradigms

Upload: leo-lester

Post on 06-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

The slow road to theeukaryotic genomeLeo Lester, Andrew Meade, and Mark Pagel*

SummaryThe eukaryotic genome is a mosaic of eubacterial andarchaeal genes in addition to those unique to itself. Themosaic may have arisen as the result of two prokaryotesmerging their genomes, or from genes acquired froman endosymbiont of eubacterial origin. A third possibilityis that the eukaryotic genome arose from successiveevents of lateral gene transfer over long periods of time.This theory does not exclude the endosymbiont, butquestions whether it is necessary to explain the peculiarset of eukaryotic genes.Weusephylogenetic studies andreconstructions of ancestral first appearances of geneson the prokaryotic phylogeny to assess evidence for thelateral gene transfer scenario. We find that phylogeniesadvanced to support fusion can also arise from asuccession of lateral gene transfer events. Our recon-structions of ancestral first appearances of genes revealthat the various genes that make up the eukaryoticmosaic arose at different times and in diverse lineageson the prokaryotic tree, and were not available in a singlelineage. Successive events of lateral gene transfer canexplain the unusual mosaic structure of the eukaryoticgenome,with its content linked to the immediate adaptivevalue of the genes its acquired. Progress in under-standingeukaryotesmaycome from identifyingancestralfeatures such as the eukaryotic splicesome that couldexplain why this lineage invaded, or created, the eukar-yotic niche. BioEssays 28:57–64, 2006.� 2005 Wiley Periodicals, Inc.

Introduction

The phylogenetic placement of the eukaryotes among the

prokaryotes has been called ‘‘evolution’s sorest spot’’.(1)

Almost thirty years ago, Woese suggested a classification

system that divided life into three domains.(2) the Eukaryota,

the Eubacteria and the Archaea. Phylogenetic studies using

ribosomal markers and other essential genes gave support to

Woese’s view by showing that the domains were monophy-

letic,(3–5) and primordially duplicated genes placed the root

of all life within the eubacteria, leaving the archaea and

eukaryotes as sister taxa.(6,7)

Recent genomic studies have begun to complicate this

picture.As further geneshavebeensequenced, so theposition

of theeukaryoteshasbeen found to jump.RibosomalRNAmight

place the eukaryotes alongside the archaea, but other genes

make them sister to the eubacteria.(8) More generally, so-called

informational genes, those involved in essential housekeeping

duties, return topologies in which the eukaryotes and archaea

are sisters, while metabolic genes give trees which place

the eukaryotes closer to the eubacteria.(8,9) These alternative

phylogenies stem from the unusual nature of the eukaryotic

genome: it turns out to be a mosaic of the prokaryotic

domains.(10–12)

How this mosaic formed is fundamental to theories of the

origin and early evolution of the lineage that eventually gave

rise to the eukaryotes. One idea proposes that the mosaic

genome arose from an ancient fusion event between an

archaeon and a eubacterium,(10) possibly deriving from a

symbiotic relationship between the two. Whether or not this

early fusion ever occurred, a second idea links the mosaic

to the endosymbiotic origin of the mitcohondria.(13,14) Theory

proposes that the endosymbiont was eubacterial(15)

and that, over time, many of its genes transferred to the

eukaryotic nucleus.(16) Whether genes were acquired from an

endosymbiont or some other source, it has been suggested

that the majority coded for metabolic capabilities,(17,18)

although more recent work shows that informational genes

of apparently eubacterial origin can also be found in the

eukaryotic nucleus.(19)

Either of the fusion hypotheses can explain the broad

features of the eukaryotic mosaic—its complements of

eubacterial and archaeal genes—but equally neither excludes

a third possibility; this is that myriad lateral gene transfer

events among the prokaryotes over long periods of time slowly

built up a unique lineage, one that we now recognise as

eukaryotic.(20) This slow-drip scenario does not doubt the

existence of the endosymbiont, but questions whether it is

necessary to invoke it to explain the nuclear mosaic.(17,20)

As part of an already functioning organism, many of the

endosymbiont’s functions may have been redundant or

unnecessary,(22) and itmay simply have lostmanyof its genes.

School of Animal and Microbial Sciences, The University of Reading

University, UK.

Funding agency: Biotechnology and Biological Sciences Research

Council (UK); Grant numbers: 45/G14980 and 45/G19848 awarded to

Mark Pagel.

*Correspondence to: Mark Pagel, The University of Reading Uni-

versity, School of Animal and Microbial Sciences, Whiteknights, PO

Box 228, Reading RG6 6AJ, UK. E-mail: [email protected]

DOI 10.1002/bies.20344

Published online in Wiley InterScience (www.interscience.wiley.com).

BioEssays 28:57–64, � 2005 Wiley Periodicals, Inc. BioEssays 28.1 57

Problems and paradigms

Our interest here is to assess the evidence for the slow-drip

hypothesis, using phylogeneticmethods applied to data on the

presence and absence of genes in both prokaryotes and

eukaryotes. We first show how successive events of lateral

gene transfer can produce both a mosaic genome and

phylogenetic patterns indistinguishable from those that fusion

arguments predict. We then use statistical models to

reconstruct probable first appearances of the metabolic and

informational genes that make up the eukaryotic mosaic. We

find that the number of genes that canbeexplainedby fusion or

slow-drip theories—asopposed tobeingpresent ancestrally—

is small when compared to a typical prokaryotic genome. We

show that thesegenes first appeared in theprokaryotesat very

different times and in diverse lineages: they were not all

available in a single species for a fusion or endosymbiosis

event to transfer them to the eukaryote. We conclude with a

discussion of how the slow-drip view fitswith our knowledge of

the origin of the peculiar features associated with eukaryotes

and with features of the mitochondrial proteome.

Fusion and a ring of life

Fusion provides an explanation for how the eukaryotic

genome acquired its unusual mosaic collection of genes. This

supposition, though, never received specific support until a

recent novel phylogenetic argument purporting to find not a

tree of life, but a ring of life.(23) Lake and Rivera(23) show that

when two genomes fuse to produce a new third species, a

peculiar distribution of phylogenetic trees is expected to arise

from resamplings of the original data. This set of trees has the

property that, in some, the fusion species clusters closest to

one of the putative donors and, in other trees in the set, it

clusters with the other donor species. Significantly, intermedi-

ate trees, in which the fusion species clusters with other

species that fall phylogenetically somewhere between the two

donors, are not expected. This means that the trees in the set

can be written as permutations of a cycle graph in which the

order of species is preserved.

Fig. 1 reproduces Lake and Rivera’s argument. For three

species (X, Y, Z), there are eight possible combinations of

the presence and absence of a gene. All these combinations

are possible but some are more likely than others depending

upon the phylogenetic relationships among the three species.

Now consider that species X and Y fuse their genomes to

produce a new species W. Whenever X or Y has a gene, W

receives it. Species Z is not part of the fusion. Of the eight

original combinations, only two are phylogenetically informa-

tive about the relative position of W with respect to X or Y. One

of these favours a tree that placesW next to X, the other a tree

in which W is placed next to Y. If the trees are aligned around

speciesW, they forma repeating pattern that can be described

by the ring shown there. A graphical interpretation of this result

is that of a ring of life: W has received genes from both sides of

a phylogenetic ring that joins X, W and Y.

Rivera and Lake(23) applied this logic to the analysis of gene

presence/absence data from three eubacterial, three archaeal

and two eukaryotic species, using their method of conditioned

reconstruction.(24) With genomes of unequal length, it is

impossible to say with certainty how similar or different two

species are. The reason is that, of the four categories of gene

presence/absence in two species, the proportion of genes in

the ‘both-genes-absent’ category is arbitrary, depending upon

the length of the longer genome. If one species were to have a

genome of length 100 and two others genomes of length 75,

there would be up to 25 genes present in the species

with the longest genome that may not be present in either of

the genomes of the other two. These other two will look more

similar than perhaps they really are by virtue of being identical

on these arbitrary shared-absence sites. Lake andRivera’s(24)

‘conditioned reconstruction’ algorithm chooses a genome

against which to condition all of the other genomes, thereby

giving a definable upper limit in any given data set to the

proportion of absent/absent sites.

In two separate phylogenetic analyses using their condi-

tioned data, Rivera and Lake found sets of alternative

topologies as predicted from the fusion argument. In one, five

alternative topologies accounted for 96.3% of the trees

observed and, in the other, five alternative trees accounted

for 87.1% of the results. Higher proportions could be obtain-

ed by collapsing some nodes. All the alternative topologies

Figure 1. Drawn after Lake and Rivera (Fig. 3).(24) The table

shows the eight possible combinations of gene presence and

absence for the three species X, Y and Z. The eight

combinations arise with varying probabilities, labelled a to h,

whichmust sum to 1.0.W is a fusion of X andY such that a gene

is present in W if it is present in either donor genome. Only two

patterns in the table are phylogenetically informative about the

relative position of W with respect to X or Y; these are denoted

i and ii. The phylogenies inferred from these two sites are

represented in the linear diagrams labelled i and ii respectively.

These linear diagrams do not show relationships in theway of a

conventional unrooted tree, but in terms of which species

comes out next to which. In i, W comes out next to Y, and Z next

to X; in ii, W is next to X. The remaining patterns yield

unresolved trees. The two resolved topologies that emerge

from a fusion can be visualised as a ring (diagram iii).

Problems and paradigms

58 BioEssays 28.1

could be aligned as in Fig. 1 to represent permutations of a

cycle graph, with some placing the eukaryotes closer to the

archaea, others closer to the eubacteria, and some placing

them in between the two domains. Rivera and Lake(23) con-

clude that the three domains of life are connected not as a

phylogenybut asa ring of life inwhich fusion has causedgenes

to flow into the eukaryotes from both prokaryote domains.

Lateral gene transfer and a ring of life

Thestrength of evidence that a ring-like tree structure provides

for fusion depends upon whether other processes could also

produce the predicted set of trees. Consider, in Fig. 1, thatW’s

genomemight have been shaped, not by a single fusion event,

but by successive events of lateral gene transfer from species

additional to those shown. Let a donor other than X contribute

one of the phylogenetically informative genes, or a donor other

than Y contribute the other. The trees characteristic of the ring

structure will still describe these species’ data even though no

fusion occurred. Phylogenetic similarity among closely related

species means that many will have similar complements of

genes, making them possible donors of the phylogenetically

informative genes.

A simple computer simulation shows that this argument can

be applied more generally. We simulated gene presence/

absence data for 2000 genes on a tree of eight taxa (Fig. 2A),

imagining them to consist of four eubacterial (B1–4) and four

archaeal (A1–4) species. Each of our species, therefore, had a

string of 2000 presence/absence codes, corresponding to

whether or not the species carried the gene at that locus. By

simulating the data, we avoid the problems of genomes of

unequal lengths for which the conditioned reconstruction

method was proposed. We inferred phylogenies using

standard maximum likelihood methods for binary data,

although simpler parsimony methods give the same results.

The simulated data returned the original tree.

We chose two of the simulated genomes, one from the

eubacterial domain (B4), the other an archaeon (A1), to act as

donors to form a hybrid ‘eukaryotic fusion’ genome (E)

comprising a complete hybrid of the two donors: if either or

both of the donors held a gene, then the hybrid did aswell, just

as in the logic of Fig. 1. Next, we simulated evolution by

allowing the original species’ genomes to gain or lose genes

over a number of generations. A gene could only be gained by

a species if it was present in at least one other species; that is,

genes were not created de novo. Gains can, therefore, be

treated as events of lateral gene transfer. For the original eight

genomes, stabilizing selection was imposed so that mutations

(genes gained or lost) that moved the evolving genome away

from the simulated starting point were fixed at a lower

probability than mutations that moved it back. The genome

of the fusion species (E) was treated differently. At the

beginning of the generational simulations, we assigned it the

genome of one of its original donors (A1). We then imposed

directional selection on this genome such that newly acquired

genes that moved it towards the hybrid fusion genome were

fixed at a higher probability than were others. The ratios of

fixation probabilities for positive and negative mutations were

the same for all species.

We ran the generational simulation, allowing species to

gain or lose genes each generation, until an equilibrium

similarity was reached between the evolving genomes and

either their fixed starting points, or, in the case of the hybrid

fusion species, the fixed endpoint. This gave us evolved data,

influenced by lateral gene transfer, for each of the species.

Removing the hybrid fusion species E, the evolved data

reproduced the original tree. Analysing the simulated data

with the hybrid ‘eukaryote’ included produced six alternative

Figure 2. A: The tree used in the simulations of lateral gene

transfer. We imagine four eubacterial (B1–4) and four archaeal

(A1–4) species, giving a total of eight taxa. These correspond to

the labels in Fig. 1 as follows: B4¼X, andA1¼Y. The remaining

six species can be seen as various Zs: background species not

involved in the ‘fusion’.B: The trees arising from the simulation

of lateral gene transfer, and drawn to show the linear

permutations (after Rivera and Lake(23)). In addition to the

eight taxa in the tree in part A, there is now the single hybrid

eukaryotic species (here labelled E, but corresponding to Fig.

1’sW), giving a total of nine taxa. If our simulated data produce

a ring then we expect all the topologies to conform to

permutations of the basic B-E-A order. These permutations

are shown to the left of each topology and the cumulative

percentages of the topologies in the bootstrap sample to the

right. All six topologies are linear permutations of an underlying

cycle graph, despite some phylogenetic uncertainty within

‘domains’. Lateral gene transfer can produce cycle-graphs.

Problems and paradigms

BioEssays 28.1 59

topologies that cumulatively account for 97% of the observed

trees (Fig. 2B). By aligning these six topologies around the

‘eukaryote’ species, it can be seen that they are permutations

of an apparently underlying cycle graph linking the eubacterial

(B), eukaryotic (E) andarchaeal (A) species.All the topologies,

despite some phylogenetic uncertainty within ‘domains’,

conform to permutations of the basic B-E-A order, as shown

to the left of each topology. In some trees, the ‘eukaryotes’ are

ambiguously between the two ‘prokaryotic’ domains; in others,

they are clearly with either the ‘eubacteria’ (2.7%) or the

‘archaea’ (17.6%).

The simulations show that sets of trees conforming to

permutations of a cycle graph, as in Fig. 1, can arise solely

from a succession of lateral gene transfer events, and do not

require fusion. An objection might be that all that is shown by

these simulations is that we can impose directional selection

on a genome. But the deeper phylogenetic issue is that in any

data set in which there has been lateral gene transfer, species

will show affinities to more than one other species at the gene

level: lateral gene transfer produces conflicting phylogenetic

signals. If the conflicting signals in real data are divided among

other species, the sort of phylogenetic cycle graph that we

have produced can be observed in bootstrap samples.

Phylogenetic similarity among closely related species ensures

that many can act as donors and still give an apparently ring-

like result: stochastic variation means that there will always be

two that contributemore than others. No special mechanism is

required to explain species or lineages moving about in

phylogenetic trees. It is a consequence of conflicting phyloge-

netic signal, which has many causes.

Ancestral states

Even if lateral gene transfer canproduce cycle-graphsof trees,

the eukaryote genome is unusual in having so many of its

genes seemingly derived from the eubacterial domain,(19) this

is the feature often taken as evidence either of a fusion or an

endosymbiotic origin of the eukaryotic genome. How else

could the presence of somanygenes be explained? Yet fusion

or endosymbiont theories must also explain why the many

genes available froma fusion or endosymbiotic eventwould be

retained, unless they had some immediate adaptive value.

Successive events of lateral gene transfer provide a plausible,

if pedestrian, mechanism: the many genes in eukaryotes of

apparently eubacterial origin are there because they have

been acquired over time for their adaptive value at the time of

acquisition, and not for some possible future function.

We cannot now assess the possible advantages these

genes conferred, butwecanexamine two issues that reflect on

the plausibility of the lateral gene transfer explanation. One of

these is to determine how many genes need explaining. The

eukaryotic genome is large and, although lateral gene transfer

is common,(25–27) we wish to know whether it is common

enough to explain the eukaryotic mosaic. The second issue is

related to the first. Whereas fusion theories for the mosaic

identifya single source, lateral gene transfer allows amultitude

of sources. We can use ancestral reconstructions to identify

the probable first appearances, on the prokaryotic tree, of

genes that are now found in the eukaryotic mosaic. These will

show whether they tend to be confined to a single lineage and

were all available at one time, or whether their appearances

are distributed throughout the prokaryotic tree.

The NCBI Clusters of Orthologous Groups, or COGs,

database(28) and the more recent dataset from Esser et al(19)

can be used to investigate both these questions. The COGs

database records the presence or absence in a large number

of prokaryotic species plus several eukaryotes of 2597 sets of

orthologousproteinswith informational ormetabolic functions.

Esser et al. searched all the genes in the yeast (Sacchar-

omyces cervesiae) genome, identifying 850 that had possible

homologues among the prokaryotes. The two numbers differ

because the COGs data set does not require that a COG

includes a eukaryote. We used both data sets to infer the

probable ancestral states of genes (present or absent) at each

of the nodes of a phylogeny of the prokaryotes and eukaryotes

(Fig. 3). Ancestral states were inferred from maximum

likelihood statistical methods, allowing for unequal rates of

gains and losses on the tree, and fitting a separate model to

each gene.(29)

We first sought to identify the subset of genes in the

eukaryotes that are candidates for possible explanation either

by fusion or lateral gene transfer from the eubacteria. These

are genes that are found in both the eubacteria and in at least

one of the eukaryotes, but are inferred not to be ancestral

either to thearchaeaor to the commonancestor of thearchaea

andeukaryotes.Our criteria ensured that genes inferredasnot

ancestral were generally absent in all, or nearly all, extant

archaea. This may mean that we overestimate the size of the

candidate set for horizontal transfer. Nevertheless, it is

possible that some of the genes that we identify as absent

may have been ancestrally present but were later lost in all

archaea.

Our procedure identifies 1100 orthologues from the COGs

database and 665 genes from the Esser et al. data. The

discrepancy in numbers probably arises because the COGs

data are constructed using more lenient sequence-similarity

rules than Esser et al. We also had two eukaryotes

(S. cerevisiae and Schizosaccharomyces pombe) in the

COGs data to match against the prokaryotes, compared to

just S. cerevisiae in the Esser et al. data set. Metabolic genes

predominate in both sub sets (65% of the COGs and 60% of

the Esser sample), but large numbers of informational genes

have also apparently been gained from the eubacteria.

This accords with Esser et al.’s finding that eukaryotes, as

represented by yeast, have more genes of eubacterial

ancestry than archaeal ancestry.(19)

Problems and paradigms

60 BioEssays 28.1

Our first question concerned whether the number of genes

in the candidate sets calls for special explanations. In fact,

these numbers may suggest that lateral gene transfer has

played a smaller role in eubacterial evolution than is some-

times assumed. Given the timescales involved in the evolution

of eukaryotes, rates of lateral gene transfer would have to rise

little higher than 1�10�6 genes/year to account for 1100

genes. Taking a typical gene to be around 1 kb in length, this is

within the rates of 16 kb laterally transferred per million years

inferred for Escerichia coli.(30)

To examine our second question, that of the phylogenetic

distribution of the candidate genes,we inferred thepoint of first

appearance of each gene on the tree in Fig. 3. In reconstruct-

ing first appearances on the tree, we adopt a liberal criterion,

requiring only a 70% confidence in the inference, well below

the conventional 95% criterion.(31) The numbers shown at

each node record the number of genes reconstructed to have

first appeared at that node: the first number corresponds to the

reconstructions for the COGs database and the second to

those from Esser et al. At the base of the tree, the COGs data

suggest that a greater proportion of the geneswere ancestral.

This almost certainly reflects the more lenient definition of a

homologue in that data set. By using a less-strict criterion for a

match between two genes, the COGs genes tend to be more

widely phylogeneticially distributed, resulting in more being

reconstructed as present at the base.

Despite these initial differences, the overall pattern in Fig. 3

is one of the gradual accumulation of genes over long periods

of time and in phylogenetically diverse lineages. The various

metabolic and informational functions that eukaryotes ac-

quired were not all invented in one lineage but arose, perhaps

in response to varying environmental demands, in a variety of

lineages. No single lineage carries a large proportion of the

candidate genes at any one time, and only a relatively small

number of genes is reconstructed to have been present early

in eubacterial evolution. This accords with analyses of whole

genomes showing that eukaryotes share genes with a wide

range of prokaryotes, with no single prokaryotic species

dominating.(19,32,33) In a few branches of the tree, compara-

tively large numbers of genes do arise. These tend to

correspond to a recent radiation of clades or groups of

species, suggesting that diversification into new niches

required new kinds of genetic functions.

The eubacteria are now highly genetically diverse, but logic

and the data give no reason to believe that they started outwith

complex genomes and then diversified by a process of

sculpting away unnecessary or irrelevant tranches of genes,

even if specific species have undergone reductive evolution. In

combination with Fig. 3, this leaves fusion theories with an

awkward choice. For the fusion partner to have had enough

genes to explain the contemporary data, the event would have

had to have taken place near the tips of the prokaryotic tree.

But the tips of the tree extendback a fewhundredmillion years,

Figure 3. Reconstructed first appearances of genes identified

ascandidates for horizontal transfer to the eukaryotes (see text).

The tree is drawn to capture the main features of prokaryote

phylogeny supported in part and whole by several recently

published papers.(58–62) Different trees will alter specific details

but not the broad patterns. Gene presence/absence data were

taken both from the NCBI COGs database(28) and from Esser

et al.(19) The ancestral reconstruction programme used was

BayesMultiState.(29,31) Genes were identified as candidates for

horizontal transmission to the eukaryotes from the eubacteria if

present in the eubacteria and in at least one of the eukaryotes,

but which are inferred not to be ancestral to the archaea or to

the common ancestor of the archaea and eukaryotes. This

procedure identifies 1100 orthologues from the COGs database

and 665 genes from the Esser et al. data. In reconstructing first

appearances on the tree, we adopt a liberal criterion, requiring

only a 70% confidence in the inference, well below the con-

ventional 95% criterion.(31) The first set of numbers above each

node relate to the number of genes reconstructed from the

COGs database to have first appeared at that node. The second

set of numbers represents the same subset but this time drawn

from the Esser et al. dataset. Genes tend to appear gradually

throughout the tree and in diverse lineages: no one lineage has a

large proportion of the total genes. Our liberal reconstruction

criterion will tend to reconstruct genes earlier in prokaryote

evolution than a stricter criterion would. The tree does not

reconstruct the loss of genes, only the point at which new genes

are gained. For this reason, the numbers at sequential nodes

shouldnot beadded together to find the total numberofgenes for

a particular species. Nodes given the number zero signify

lineages in which there has been no net gain of new genes.

Problems and paradigms

BioEssays 28.1 61

not the 1.5 billion needed for the origin of the eukaryote

lineage.(34) If an earlier node is identified as the fusion partner,

then so much subsequent lateral gene transfer must be in-

voked to complete the eukaryotic set of eubacterial genes that

the fusion event ceases to be a revolutionary point of origin.

A potentially confounding influence in Fig. 3 is that the

genes reconstructed as first appearing near the tipsmay be so

highly divergent as not to be recognised as present in other

eubacterial species. If this were true, their first appearances

should be reconstructed earlier in the tree. This seems unlikely

because each of the genes that we reconstruct in Fig. 3 has

been identified as being present in yeast as well as in at least

one eubacterium. The Esser et al. data set, by providing

measures of sequence similarity between each yeast gene

and its eubacterial homologue, allows a test of this possibility.

Fig. 4 plots the average similarity scores for genes recon-

structed tomake their first appearance at varying phylogenetic

distances from the root of the tree. If the rate of evolution does

confound the result, we would expect that the average

similarity should be lower for genes reconstructed higher up

the tree. The figure shows that, in contrast to this expectation,

the range of similarity scores for genes reconstructed the

furthest from the root is comparable to that for genes

reconstructed as appearing at the root of the tree.

Another possibility is that these results are dependent on

the tree. The topology of the tree in Fig. 3 is conventional in

sharing many similarities to published phylogenies of the

prokaryotes. Cavalier-Smith has suggested that the eukar-

yotes and archaea should be placed on the actinobacterial

branch.(35)Whenweperform our ancestral reconstructions on

this tree, the same patterns of gradual accumulation still

emerge.

Conclusions

We find theoretical and empirical support for the notion that

successive events of lateral gene transfer over time, without

recourse to fusion or endosymbionts, can explain the broad

outlines of the eukaryotic mosaic. The presence of the

mitochondrion attests to endosymbiosis having occurred in

the eukaryotic lineage, and instances of transfer of genes from

the mitochondrion to the nucleus are well documented,(16,36)

butwedonot findempirical evidence compelling us towardsan

interpretation that relies upon such transfers to explain the

presence of the eubacterial fraction of genes in eukaryotes.

While it is true that the vast majority of the genes that code

for the mitochondrial proteome are found within the nu-

cleus,(37,38) comparatively few can be traced back to the

putative mitochondrial ancestor.(39,40) We compared a recent

list of the 750 genes that constitute the mitochondrial

proteome(38) to the 850 (nuclear) genes in the Esser et al.

dataset. Only 62 of these proteins had a homologue within the

prokaryotes, and, of those, only 51 had a homologue amongst

the eubacteria. The remaining genes in the mitochondrial

proteome are evidently eukaryotic inventions: most of the

endosymbiont’s genes have simply been lost.(17,21,40) This

may not be surprising. The transfer of genes from the

mitochondrion to the nucleus requires a sophisticated set of

controlling proteins and for each stage to retain its function-

ality.(16,39) The loss of genes from the mitochondrion might be

better understood as similar to the reductive evolution that

occurs in the genes of obligate parasites.(41)

The difference between the prokaryotes and eukaryotes in

terms of cell structure, genetics and evenmolecular biology, is

so great that it was for long seen as the central divide in

biology.(42,43) There are no obvious transitional forms between

the two(44) and the gulf can at times seem insuperable,

demanding of some saltational event. Indeed, the sequential

evolution of prokaryotes to eukaryotes has long been

doubted,(45) even though no biologist now questions that all

life is related.(46) Such doubtswere not misplaced: eukaryotes

are not a product of sequential evolution in the conventional

Darwinian sense; they are almost certainly a product of the

prokaryotic domains.(19,47) The absence of contemporary

mechanisms of lateral gene transfer in eukaryotes is some-

times taken as evidence that the eukaryotic hybrid could not

have arisen from events of this kind. But yeast are a derived

lineage and what matters is whether there could have been

Figure 4. The plot records the average sequence similarity

between genes in eubacteria and their yeast homologue for

each of the 665 genes in the Esser et al.(19) data, which

have been identified as candidates for lateral transfer from

the eubacteria to the eukaryotes. These are plotted against

the distance (path length) from the root of the tree to the node

where the gene is reconstructed to have first appeared among

the eubacteria. Although there is a trend for older genes

(reconstructed as having arisen nearer the root) to show higher

sequence similarity, this is an artefact caused by the second

group of genes from the left. Even genes reconstructed near

the tip of the tree can have high sequence similarity. Where a

gene is reconstructed to arise appears not to be confounded by

its rate of evolution.

Problems and paradigms

62 BioEssays 28.1

lateral gene transfer to the progenitor of all eukaryotes.

This supposition seems reasonable: lateral gene transfer

occurs in prokaryotes; phagocytosis, on the other hand,

seems not to.(48)

Understanding ‘eukaryote’ evolution partly rests on the

question of what is meant by their origin. Is endosymbiosis

the sine qua non of eukaryotes,(49) or do eukaryotes predate

the first endosymbiosis event?(42,50) Most theories that

attempt to describe the evolution of the eukaryotes rest on

the former supposition: some form of fusion is required

between prokaryotic cells, and it is the resultant symbiosis that

forms the eukaryotic lineage.(13,18) Even though there are no

extant primitively amitochondriate species,(51) suchorganelles

cannot be seen as the diagnostic trait of eukaryotes, for many

extant taxa have lost their mitochondria.(50,52) Whilst endo-

symbiosis may have occurred very soon after the origin of

eukaryotes, it need not be seen as the process that formed

them.(50)

Modern ‘eukaryoteness’, then, is more than just a peculiar

genome and the presence of organelles. It is a set of traits that

is distinctive in its combination(48) and likely emerged not in a

singular event, but through the pulling together of many

different threads that only became available through time.

Individually, many of the genes that make the eukaryotic

mosaic distinctive are found in disparate taxa, spread across

the prokaryote phylogeny.(53–56) What requires a special

explanation is not the presence of those genes, but why it

was that the eukaryote lineage that seemingly invaded, or

perhaps invented, a new ecological niche in which a mixed

complement of archaeal, eubacterial and, eventually, many

new eukaryotic genes and structures would be required.

Recent evidence for an ancient origin of the unique and

sophisticated eukaryotic splicesosome(57) opens one promis-

ing avenue of research to this most unusual lineage.

References1. Martin W, Embley TM. 2004. Early evolution comes full circle. Nature

431:134–136.

2. Woese CR, Fox GE. 1977. Phylogenetic structure of the prokaryotic

domain: the primary kingdoms. Proc Natl Acad Sci USA 74:5088–5090.

3. Brown JR, Doolittle WF. 1995. Root of the universal tree of life based on

ancient aminoacyl-tRNA synthetase gene duplications. Proc Natl Acad

Sci USA 92:2441–2445.

4. Brown JR, Douady CJ, Italia MJ, Marshall WE, Stanhope MJ. 2001.

Universal trees based on large combined protein sequence data sets.

Nat Genet 28:281–285.

5. Daubin V, Gouy M, Perriere G. 2002. A phylogenomic approach to

bacterial phylogeny: evidence of a core of genes sharing a common

history. Genome Res 12:1080–1090.

6. Iwabe N, Kuma K, Hasegawa M, Osawa S, Miyata T. 1989. Evolutionary

relationship of archaebacteria, eubacteria and eukaryotes inferred from

phylogenetic trees of duplicated genes. Proc Natl Acad Sci USA 86:

9355–9359.

7. Gribaldo S, Cammarano P. 1998. The root of the universal tree of life

inferred from anciently duplicated genes encoding components of the

protein-targeting machinery. J Mol Evol 47:508–516.

8. Ribeiro S, Golding GB. 1998. The mosaic nature of the eukaryotic

nucleus. Mol Biol Evol 15:779–788.

9. Rivera MC, Jain R, Moore JE, Lake JA. 1998. Genomic evidence for two

functionally distinct gene classes. Proc Natl Acad Sci USA 95:6239–

6244.

10. Gupta R. 1997. Protein phylogenies and signature sequences: evolu-

tionary relationships within prokaryotes and between prokaryotes and

eukaryotes. Antonie van Leeuwenhoek 72:49–61.

11. Martin W. 1999. Mosaic bacterial chromosomes: a challenge en route to

a tree of genomes. BioEssays 21:99–104.

12. Katz LA. 2002. Lateral gene transfers and the evolution of eukaryotes:

theories and data. Int J Syst Evol Microbiol 52:1893–1900.

13. Martin W, Muller M. 1998. The hydrogen hypothesis for the first

eukaryote. Nature 392:37–41.

14. Lopez-Garcia P, Moreira D. 1999. Metabolic symbiosis at the origin of

eukaryotes. Trends Biochem Sci 24:88–93.

15. Margulis L. 1970. Origin of Eukaryotic Cells. New Haven: Yale University

Press.

16. Martin W, Herrmann RG. 1998. Gene transfer from organelles to the

nucleus: how much, what happens, and why? Plant Physiol 118:9–17.

17. Andersson SGE, Kurland CG. 1999. Origins of mitochondria and

hydrogenosomes. Curr Opin Microbiol 2:535–541.

18. Moreira D, Lopez-Garcia P. 1998. Symbiosis between methanogenic

archaea and d-proteobacteria as the origin of eukaryotes: the syntrophic

hypothesis. J Mol Evol 47:517–530.

19. Esser C, Ahmadinejad N, Wiegand C, Rotte C, Sebastiani F, et al. 2004.

A genome phylogeny for mitochondria among a-Proteobacteria and a

predominantly eubacterial ancestry of yeast nuclear genes. Mol Biol Evol

21:1643–1660.

20. Doolittle WF. 1998. You are what you eat: a gene transfer ratchet could

account for bacterial genes in eukaryotic nuclear genomes. Trends

Genet 14:307–311.

21. Karlberg O, Canback B, Kurland CG, Andersson GE. 2000. The dual

origin of the yeast mitochondrial proteome. Yeast 17:170–187.

22. Bapteste E, Gribaldo S. 2003. The genome reduction hypothesis and the

phylogeny of eukaryotes. Trends Genet 19:696–700.

23. Rivera MC, Lake JA. 2004. The ring of life provides evidence for a

genome fusion origin of eukaryotes. Nature 431:152–155.

24. Lake JA, Rivera MC. 2004. Deriving the genomic tree of life in the

presence of horizontal gene transfer: conditioned reconstruction. Mol

Biol Evol 21:681–690.

25. Lawrence J, Hendrickson H. 2003. Lateral gene transfer: when will

adolescence end? Mol Microbiol 50:739–749.

26. Brown J. 2001. Genomic and phylogenetic perspectives on the evolution

of prokaryotes. Syst Biol 50:497–512.

27. Garcia-Vallve S, Romeu A, Palau J. 2000. Horizontal gene transfer

in Bacterial and Archaeal complete genomes. Genome Res 10:1719–

1725.

28. Tatusov RL, Galperin MY, Natale DA, Koonin EV. 2000. The COG

database: a tool for genome-scale analysis of protein functions and

evolution. Nucleic Acids Res 28:33–36.

29. Pagel M, Meade A, Barker D. 2004. Bayesian Estimation of Ancestral

Character States on Phylogenies. Syst Biol 53:673–684.

30. Ochman H, Lawrence J, Groisman EA. 2000. Lateral gene transfer and

the nature of bacterial innovation. Nature 405:299–304.

31. Pagel M. 1999. The maximum likelihood approach to reconstructing

ancestral character states of discrete characters on phylogenies. Syst

Biol 48:612–622.

32. Brown JR. 2003. Ancient horizontal gene transfer. Nat Rev Genet 4:

121–132.

33. Doolittle WF. 1999. Phylogenetic classification and the universal tree.

Science 284:2124–2128.

34. Javaux EJ, Knoll AH, Walter M. 2003. Recognising and interpreting the

fossils of early eukaryotes. Origins Life Evol B 33:75–94.

35. Cavalier-Smith T. 2002. The phagotrophic origin of eukaryotes and

phylogenetic classification of Protozoa. Int J Syst Evol Micr 52:297–354.

36. Gray MW, Long BF, Cedergren R, Golding GB, Lemieux C, et al. 1998.

Genome structure and gene content in protist mitochondrial DNAs.

Nucleic Acids Res 26:865–878.

37. Prokisch H, Scharfe C, Camp DG II, Xiao W, David L, et al. 2004.

Integrative analysis of the mitochondrial proteome in yeast. PLoS Biol

2:795–804.

Problems and paradigms

BioEssays 28.1 63

38. Sickmann A, Reinders J, Wagner Y, Joppich C, Zahedi R, et al. 2003.

The proteome of Saccharomyces cerevisiae mitochondria. Proc Natl

Acad Sci USA 100:13207–13212.

39. Gabaldon T, Huynen M. 2004. Shaping the mitochondrial proteome.

Biochim Biophys Acta 1659:212–220.

40. Gray MW, Burger G, Lang BF. 2001. The origin and early evolution of

mitochondria. Genome Biol 2:1018.1–1018.5.

41. Andersson SGE, Kurland CG. 1998. Reductive evolution of resident

genomes. Trends Mircobiol 6:263–268.

42. Cavalier-Smith T. 1987. The origin of eukaryote and archaebacterial

cells. Ann N Y Acad Sci 503:17–54.

43. Mayr E. 1998. Two empires or three? Proc Natl Acad Sci USA 95:9720–

9723.

44. Doolittle WF. 1998. A paradigm gets shifty. Nature 392:15–16.

45. Darnell JE Jr. 1978. Implications of RNA-RNA splicing in evolution of

eukaryotic cells. Science 202:1257–1260.

46. Doolittle RF. 2000. Searching for the common ancestor. Res Microbiol

151:85–89.

47. Horiike T, Hamada K, Miyata D, Shinozawa T. 2004. The origin of

eukaryotes is suggested as the symbiosis of Pyrococcus into g-

Proteobacteria by phylogenetic tree based on gene content. J Mol Evol

59:606–619.

48. Vellai T, Vida G. 1999. The origin of eukaryotes: the difference between

prokaryotic and eukaryotic cells. Proc Roy Soc Lond Ser B 266:1571–

1577.

49. Doolittle WF. 1999. Rethinking the origin of eukaryotes. Biol Bull

196:378–380.

50. Embley TM, Hirt RP. 1998. Early branching eukaryotes? Curr Opin Genet

Dev 8:624–629.

51. Cavalier-Smith T. 2002. Origins of the machinery of recombination and

sex. Heredity 88:125–141.

52. Williams BAP, Hirt RP, Lucocq JM, Embley TM. 2002. A mitochondiral

remnant in the microsporidian Trachipleistophora hominis. Nature

418:865–869.

53. Searcy DG, Hixon WG. 1991. Cytoskeletal origins in sulfur-metabolising

archaebacteria. BioSyst 25:1–11.

54. Sioud M, Baldacci G, Forterre P, Recondo A. 1987. Antitumour

drugs inhibit the growth of halophilic archaea. Eur J Biochem 169:

231–236.

55. Lowe J, van den Ent F, Amos LA. 2004. Molecules of the bacterial

cytoskeleton. Annu Rev Biophys Biomol Struct 33:177–198.

56. Ferat J, Michel F. 1993. Group II self-splicing introns in bacteria. Nature

364:358–361.

57. Collins L, Penny D. 2005. Complex spliceosomal organization ancestral

to extant eukaryotes. Mol Biol Evol 22:1053–1066.

58. Battistuzzi FU, Feijao A, Hedges SB. 2004. A genomic timescale of

prokaryote evolution: insights into the origin of methanogenesis,

phototrophy, and the colonization of land. BMC Evol Biol 4:44.

59. Wolf YI, Rogozin IB, Grishin NV, Koonin EV. 2002. Genome trees and the

tree of life. Trends Genet 18:472–479.

60. Gupta R, Griffiths E. 2002. Critical issues in bacterial phylogeny. Theor

Popul Biol 61:423–434.

61. Qi J, Wang B, Hao B. 2004. Whole proteome prokaryote phylogeny

without sequence alignment: a K-string composition approach. J Mol

Evol 58:1–11.

62. Dutilh BE, Huynen M, Bruno WJ, Snel B. 2004. The consistent

phylogenetic signal in genome trees revealed by reducing the impact

of noise. J Mol Evol 58:527–539.

Problems and paradigms

64 BioEssays 28.1