phylogeny and evolution of structure -...

66
Phylogeny and evolution of structure Tanja Gesell and Peter Schuster Abstract Darwin’s conviction that all living beings on Earth are related and the graph of relat- edness is tree-shaped has been essentially confirmed by phylogenetic reconstruction first from morphology and later from data obtained by molecular sequencing. Limi- tations of the phylogenetic tree concept were recognized as more and more sequence information became available. The other path-breaking idea of Darwin, natural se- lection of fitter variants in populations, is cast into simple mathematical form and extended to mutation-selection dynamics. In this form the theory is directly appli- cable to RNA evolution in vitro and to virus evolution. Phylogeny and population dynamics of RNA provide complementary insights into evolution and the interplay between the two concepts will be pursued throughout this chapter. The two strate- gies for understanding evolution are ultimately related through the central paradigm of structural biology: sequence structure function. We elaborate on the state of the art in modeling both phylogeny and evolution of RNA driven by reproduction and mutation. Thereby the focus will be laid on models for phylogenetic sequence evolution as well as evolution and design of RNA structures with selected examples and notes on simulation methods. In the perspectives an attempt is made to combine molecular structure, population dynamics and phylogeny in modeling evolution. Key words: Evolution of structure; multiple structures; phylogeny; quasispecies concept; sequence-structure mappings; sequence evolution; simulations. Tanja Gesell Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), Uni- versity of Vienna, Medical University of Vienna and University of Veterinary Medicine, Dr. Bohr- Gasse 9, Vienna, 1030, Austria. School of Engineering and Applied Sciences, Harvard University, 33 Oxford Street, Cambridge, MA 02138, USA e-mail: [email protected] Peter Schuster Institute f¨ ur Theoretische Chemie der Universit¨ at Wien, W¨ ahringerstrasse 17, A-1090 Vienna, and The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA e-mail: [email protected] 1

Upload: lamthuan

Post on 28-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure

Tanja Gesell and Peter Schuster

Abstract

Darwin’s conviction that all living beings on Earth are related and the graph of relat-edness is tree-shaped has been essentially confirmed by phylogenetic reconstructionfirst from morphology and later from data obtained by molecular sequencing. Limi-tations of the phylogenetic tree concept were recognized asmore and more sequenceinformation became available. The other path-breaking idea of Darwin, natural se-lection of fitter variants in populations, is cast into simple mathematical form andextended to mutation-selection dynamics. In this form the theory is directly appli-cable to RNA evolutionin vitro and to virus evolution. Phylogeny and populationdynamics of RNA provide complementary insights into evolution and the interplaybetween the two concepts will be pursued throughout this chapter. The two strate-gies for understanding evolution are ultimately related through the central paradigmof structural biology: sequence⇒structure⇒ function. We elaborate on the state ofthe art in modeling both phylogeny and evolution of RNA driven by reproductionand mutation. Thereby the focus will be laid on models for phylogenetic sequenceevolution as well as evolution and design of RNA structures with selected examplesand notes on simulation methods. In the perspectives an attempt is made to combinemolecular structure, population dynamics and phylogeny inmodeling evolution.

Key words: Evolution of structure; multiple structures; phylogeny; quasispeciesconcept; sequence-structure mappings; sequence evolution; simulations.

Tanja GesellCenter for Integrative Bioinformatics Vienna (CIBIV), MaxF. Perutz Laboratories (MFPL), Uni-versity of Vienna, Medical University of Vienna and University of Veterinary Medicine, Dr. Bohr-Gasse 9, Vienna, 1030, Austria. School of Engineering and Applied Sciences, Harvard University,33 Oxford Street, Cambridge, MA 02138, USA e-mail: [email protected]

Peter SchusterInstitute fur Theoretische Chemie der Universitat Wien,Wahringerstrasse 17, A-1090 Vienna, andThe Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USAe-mail: [email protected]

1

Page 2: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

2 Tanja Gesell and Peter Schuster

1 Evolutionary thinking in mathematical language

James Watson, one of the researchers who discovered the structure of the DNAdouble helix, begins his text bookMolecular Biology of the Genein 1965 with aneuphoric introduction to evolution to underline its general importance. Today, al-most fifty years after the discovery of the double helix and more than 200 yearsafter Charles Darwin’s birthday in 1809, evolution is stillat the center of biologicalthinking: the statement of geneticist Theodosius Dobzhansky [33], “Nothing in bi-ology makes sense except in the light of evolution”, is frequently cited and yet, PaulGriffiths [75] has given voice to the widespread feeling thatan evolutionary perspec-tive is indeed necessary, but it must be aforward-lookingperspective allowing fora general understanding of the evolutionary process, not only a backward-lookingperspectivedealing with the specific evolutionary history of the species.

Here, we shall review current mathematical models of evolution while focusingin particular on two mechanistic aspects of evolution, phylogeny and populationdynamics with respect to structure evolution, trying to work out how they are re-lated. The first section introduces the concept of the phylogenetic tree and continueswith an attempt to translate Charles Darwin’s thoughts on natural selection intomathematical language with the knowledge of his contemporaries in mathematics.Section 2 elaborates on the state of the art in modeling phylogeny and continueswith a review of evolutionary dynamics based on reproduction and mutation. Sec-tion 3 then deals with the translation of results from theoryto applications in order todemonstrate the practical usefulness of evolutionary models. The notes (section 4)will be concentrating on computation and simulation methods of phylogenetic as-pects of the evolutionary process. In the prospects section5 we shall discuss theperspectives afforded by attempts to combine molecular structure and phylogeny inmodeling evolution, and mention future developments in theexperimental determi-nation of fitness landscapes in the sense of Sewall Wright’s metaphor [177].

Fig. 1 An evolutionary tree by Charles Dar-win. The ancestral species is at position ‘1’. Ex-tant species are denoted by endpoint and letters,and the remaining pendant edges represent ex-tinctions. On the margin of his sketch of a treeDarwin had written, ‘I think’, before expand-ing his idea inThe Origin of Species[30]: ‘Theaffinities of all the beings of the same class havesometimes been represented by a great tree. Ibelieve this simile largely speaks the truth. Thegreen and budding twigs may represent exist-ing species; and those produced during each for-mer year may represent the long succession ofextinct species...’ (First Notebook on Transmu-tation of Species, 1837, courtesy of CambridgeUniversity Library).

Page 3: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 3

Fig. 2 Sequence, structure, andfunction in structural biology.The relation between sequence,structure, and function is modeledby two successive mappings,Φfrom sequences into structures andΨ from structures into real num-bers being a quantitative measureof function.

S Y S= F( ) f = Y( )Y

F Þ: ( , ) ( , )Q YdH dY Y Þ: ( , )Y dY R1

sequence structure function

Darwin’s centennial bookThe Origin of Species[30] appeared in November 1859and represents a masterpiece of abstraction and reduction,since the principle ofevolution built upon only three factors – multiplication, variation, and (natural) se-lection in size-constrained populations – had been extracted from an overwhelmingwealth of observations, but contains not a single mathematical expression and onlyone graphic illustration representing atree of life. Fig.1 shows Darwin’s convictionthat all life on earth is related and that the pattern of relatedness is shaped like atree. Darwin had drawn this sketch already inNotebook B, more than twenty yearsearlier, adding the words ‘I think’ on its side. Darwin’s fundamental concept of thetree of descendance of species is introduced by illustrating the present-day conceptof phylogenetic trees. How might Charles Darwin have formulated his theory hadhe been a mathematician? We do not know, but we shall try to cast Darwin’s naturalselection in simple equations that were known at his time.

Phylogeny and population dynamics provide complementary insights into evo-lution and the interplay of the two concepts will be pursued throughout this chap-ter. The two strategies for understanding evolution are ultimately related throughthe central paradigm of structural biology (Fig.2): sequencesS = (s1s2 . . .sl ) foldinto structuresY and the structures encode the functions of molecules or organ-isms as expressed, for example, by fitness values. Both relations can be viewed asmappings, the first one from sequence space1 Q into structure spaceY , and thesecond one from structure space into the real numbers,Φ : (Q,dH) → (Y ,dY)andΨ : (Y ,dY) → R

1, respectively. Sequence space and structure space are metricspaces with the Hamming distancedH and the structure distancedY as metrics. Inevolution, phylogeny and selection are related by these mappings from sequenceinto function, in particular into fitness valuesf . Both mappings are many to one andhence cannot be inverted in strict mathematical sense. It isnevertheless possible toconstruct pre-images of given structures in sequence spaceas well as pre-images ofgiven functions in structure space. Their inversions then become important tools inthe design of biomolecules – as we shall discuss in section 2.3.

1 Sequences are ordered strings of elementssk (k = 1, . . ., l ), which in case of RNA are chosenfrom the alphabetA = {A,U,C,G}. The notions of sequence space and structure or shape spaceare essential for the definition of structure and function asresults of mappings.

Page 4: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

4 Tanja Gesell and Peter Schuster

1.1 The phylogenetic tree

In present-day phylogenetic research – following Darwin’sideas (Fig.1) – we as-sume thatn sequencesSn are related by an (unrooted or rooted) treeT whoseleaves represent the aligned sequences. The treeT = (V,E) consists of a vertexsetV and a branch setE ∈ V ×V [142], where the lengths of the branches ofT

are a measure of the extent of evolutionary change. Above all, in this book we areinterested in the change nucleotide patterns of RNA molecules along a phylogenetictree. The vertex setV contains the taxon setS, which maps one-to-one onto the leafset.

Definition 1. A phylogenetic treeT is a connected, undirected, acyclic graphwhose leaves are labeled bijectively by the taxon setS.

1. An unrooted phylogenetic treeT has no vertices of degree two.2. A rooted phylogenetic treeT has an internal vertex, which may have degree two

and which forms theroot of the tree.3. A star tree is a phylogenetic treeT with one internal vertex, which may have a

cardinality degree of the taxon setSor in other words, all taxa have one commonancestor.

The tree-lengthΛT is the sum of the branch lengths.

ΛT = ∑e∈E

λe (1)

whereλe > 0 represents thelengthof a branche∈ E. In commonly used phyloge-netic methods the branch length is measured in numbers of substitutions per site.Thisgenetic distance destimated from nucleotide sequences are generally based onmodels of sequence evolution – as we shall discuss in section2.1.1. By contrast, the

e3

e4

e5

e7

e6

S1

S2

S3

S5

S4

e1

e2

5

S2

S3

S1

S2

S4

S5

e1

e2

e4

e5

e6

e3

e7

e8

Fig. 3 Phylogenetic trees showing relatedness of sequences(Sn). The numbersei symbolizethe lengths of the branches measured in Hamming distancedH. The distance between any pair ofsequences can be computed by adding up the lengths of the connecting branches. Lhs: Unrootedtree of five sequences (S1 to S5): this tree does not contain a node corresponding to the ancestorof all five sequences. Rhs: Rooted tree.

Page 5: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 5

Hamming distance of two sequences,dH(Si ,S j), is the smallest number of substi-tutions, extensions, and deletions required to convert thetwo sequences into eachother.

Charles Darwin’s great foresight concerning the existenceof a tree of lifehasbeen confirmed by phylogenetic reconstruction of the evolution of multicellular eu-karyotes by making extensive use of data from molecular sequencing [72]. The ideaof a single root of the tree of life and a common universal ancestor, however, turnedout to be much less clear. Although only few doubts were raised regarding the sin-gle, non-recurring origin of life on Earth, the nature of theprimordial prokaryoteremained in the realm of speculation [37, 174], and for a longtime it remainedundecided whether such an ancestor had been a single speciesor a clan of geneti-cally strongly interacting clones. As more and more sequence information becameavailable on excessive horizontal gene transfer (see [16] and section 1.2) amongearly prokaryotic organisms, it became clear that the tree concept becomes moreand more obscure the further one approaches the distant past[38, 39]. The currentreconstruction of the early history of life on Earth by a plethora of genomic datais not supportive of a single tree of life for archaebacterial, eubacterial, and primi-tive eukaryotic species but seems to be converging towards ascenario with multiplespecies rapidly exchanging genetic information [6, 115, 146].

1.2 Limitations of the phylogenetic tree concept

Given the massive amount of sequence data observed today, a general goal in phy-logeny is to reconstruct the evolutionary history of patterns of contemporary organ-isms, typically in the form of a phylogenetic tree. Accumulations of mutations by thecopying process and environmental factors manifest the changes in the sequencescalledsubstitutions. Thus, given the vertical transmissions in time, the discrete char-acter of mutations and a well-defined alphabet, the biopolymer sequences representa unique memory of the phylogenetic past. The sequences may either come fromDNA, protein, RNA or other character-based molecules with linear arrangements ofmonomers from several classes. Misinterpretations of data, however, are possible,e.g. through time-heterogenous processes. Moreover, present day sequence-basedphylogenies of organisms are based on many different genes,which can lead tocontroversial interpretations of evolutionary relationships between the organisms.Evolution of genes is different from species evolution, andthey should not be con-fused. Indeed, there are various reasons why phylogenetic inference is not alwaysstraightforward and potentially leads to misconceptions (see Figs. 4), examples ofproblem sources are random sorting of ancestral polymorphism [132], horizontalgene transfer [36, 100], and gene duplication or gene loss [123]. As mentioned be-fore, horizontal gene transfer obscures phylogeny and the question arises of howto define species trees for a set of prokaryotic taxa and subsequently, of how sucha species tree can be inferred? In current research, phylogenetic networks offer analternative to phylogenetic trees (A good introduction to this problematic is given

Page 6: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

6 Tanja Gesell and Peter Schuster

by [83]). In addition, the fields of phylogenetic analysis and population geneticswill come closer together in the near future as complete genomic sequences forlarge numbers of individuals, strains and species will become available thanks toadvanced sequencing technologies.

However, the reader should note that the application of phylogenetic methodsgoes beyond the reconstruction of phylogenetic trees for organisms. The phyloge-netic analysis of molecular sequences using phylogenetic trees is an establishedfield and several books (e.g., [55, 142]) offer detailed descriptions of the differentapproaches. In the present chapter, we shall present some applications of phyloge-

a

5

S2

S3S2 S5S1

e1 e

2

e3

e4

e5

e8

e7

e6

t

S4

b c

X

Fig. 4 Limitations in phylogeny. L.h.s: trees for individual characters (inner tree) can differ fromthe species tree (outer gray tree). On the left branches, random sorting of ancestral polymorphism atsubsequent speciation events. On the right branches horizontal gene transfer, i.e. the lateral transferof individual genes or sequences between species. R.h.s: a duplication event on the left branchesand an instance of gene loss (x) on the right branch, both of which may lead to misconceptions ofa species tree.

Page 7: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 7

netic substitution models in the context of RNA research andcomparative genomicsin section 3. Just as phylogenetic analysis, population genetics is a mature disciplinethat has been addressed in a number of books, including [77, 79]. Both phylogenyand population genetics remain active and continually evolving areas of research.In an attempt to interconnect the two disciplines for futureRNA research, we shallfocus on the question of how phylogeny and population genetics are related to struc-ture evolution on the pages to come.

1.3 The mathematics of Darwin’s selection in populations

In 1838 the Belgium mathematician Pierre-Francois Verhulst published a kineticdifferential equation, which presumably was not known to Darwin. The Verhulst orlogistic equation describes population growth in ecosystems with finite resources:

dNdt

= N · r(

1− NK

)

and N(t) = N(0)K

N(0)+(

K−N(0))

e−r t. (2)

Fig. 5 Solution curves ofthe logistic equations(2,2’).Upper plot: The black curveillustrates growth in popula-tion size from a single in-dividual to a population atthe carrying capacity of theecosystem. The red curverepresents the results for un-limited exponential growth,N(t) = N(0)exp(rt ).Parameters:r = 2, N(0) = 1,andK = 10000.Lower plot: Growth and in-ternal selection is illustratedin a population with fourvariants. Color code:N black,N1 yellow, N2 green,N3 red,N4 blue.Parameters: fitness valuesf j = (1.75,2.25,2.35,2.80),Nj (0) = (0.8888,0.0888,0.0020,0.0004), K = 10000.The parameters were ad-justed such that the curvesfor the total populations sizeN(t) coincide (almost) inboth plots.

Page 8: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

8 Tanja Gesell and Peter Schuster

The number of individualsI in the population or the population size at timet in de-noted by[I ]t = N(t), N(0) is the population size at timet = 0,r is the growth parame-ter orMalthusianparameter named after the English economist Robert Malthus, andK is thecarrying capacity, the maximal population size that can be sustained by theecosystem. The interpretation is straightforward: Populations grow by reproductionand the population size increases proportionally to the number of individuals al-ready present times the Malthusian parameter –N(t) · r, growth requires resourcesand this is taken into account by the third factor –

(

1−N(t)/K)

, which approacheszero when the population size reaches the carrying capacityand no further growth ispossible. The solution of Verhulst equation is called thelogistic curve. In the earlyphase of growth,N(t) � K (Fig.5; upper plot, red curve), the logistic curve rep-resents unlimited exponential growth, the effect of finite resources is significant atpopulations sizes in the range of 20 % saturation,N(t) = 0.2K, and larger.

The Verhulst model of constrained growth is dealing withmultiplication, the firstfactor of Darwin’s principle.Selectionfollows straightforwardly from a partitioninginto n subpopulations [139],I j ( j = 1, . . . ,n) with [I j ] = Nj andN(t) = ∑n

j=1Nj(t).Each variant has a specific growth parameter orfitness valuedenoted byf j ; j =1, . . . ,n and the result is theselection equation(3) describing the evolution of thepopulation:

Fig. 6 Solution curve ofthe selection equation (3).The system is studied atconstant maximal populationsize, N = K, and the plotsrepresent calculated changesof the variant distributionswith time. The upper plotshows selection amongthree speciesI1 (yellow), I2(green), and I3 (red), andthen the appearance of afourth, fitter variantI4 (blue)at time t = 6, which takesover and becomes selectedthereafter. The lower plotpresents an enlargement ofthe upper plot around thepoint of spontaneous creationof the fourth species (I4).Parameters: fitness valuesf j = (1,2,3,7);x j (0) = (0.9,0.08,0.02,0)andx4(6) = 0.0001.

Page 9: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 9

dNj

dt= Nj

(

f j −NK

φ(t)

)

; j = 1, . . . , n with φ(t) =1

N(t)

n

∑i=1

fi Ni(t) . (3)

Equation (3) is solved by means of normalized variablesx j(t) = Nj(t)/N(t):

dxj

dt= x j

(

f j − φ(t))

, j = 1, . . . , n; x j (t) =x j (0) ·exp( f j t)

∑ni=1x j (0) ·exp( fi t)

, j = 1, . . . , n . (3a)

The size of the subpopulations is obtained through multiplication by the total popu-lation size

Nj (t) = N(t) ·x j(t) = N(t) ·N(0) ·exp( f j t)

∑ni=1Nj(0) ·exp( fi t)

; j = 1, . . . , n , (3b)

and hence, the knowledge of the time dependence of population size is required. Itis obtained by means of the integralΦ(t) =

∫ t0 φ(τ)dτ

N(t) = N(0)K

N(0)+(

K−N(0))

e−Φ(t). (2a)

Herein exp(

−Φ(t))

replaces exp(−rt ) in the solution of the Verhulst equation. Thecourse of selection in the variablesNj andx j is essentially the same and the restric-tion to constant population size,N = K, was done only in order to simplify the anal-ysis. Typical solution curves are shown in Fig.6. The interpretation of the solutioncurves (3b) is straightforward: For sufficiently long time the sum in the denominatoris dominated by the term containing the exponential with thehighest fitness value,fm = max{ f j ; j = 1, . . . ,n}, the consequence is that all variables exceptxm vanishand the fittest variant is selected: limt→∞[Xm] = N. Finally, the time derivative of themean fitness,dφ(t)/dt = var{ f} ≥ 0, encapsulates optimization in Darwinian evo-lution: φ(t) is optimized during natural selection (see, e.g., [139]). The expressionsfor selection are rather simple, derivation and analysis are both straightforward, andeverything needed was standard mathematics at Darwin’s time.

Mechanisms creating new variants are not part of the nineteenth century modelof evolution. Recombination and mutation were unknown and new variants appearspontaneously in the population like thedeus ex machinain the ancient antiquetheater. Fig.6 illustrates selection at constant population size and the growth of aspontaneously created advantageous variant. Eventually the simple mathematicalmodel described here encapsulates all three preconditionsof Darwin’s natural se-lection and reflects the state of knowledge of the evolutionists in the second half ofnineteenth century.

Page 10: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

10 Tanja Gesell and Peter Schuster

2 Mathematical models

Historically, the first synthesis of Darwin’s theory and Mendelian genetics was per-formed through mathematical modeling of evolution by the three scholars of pop-ulation genetics, Ronald Fisher, J.B.S. Haldane, and Sewall Wright. Modeling ofevolution and many other topics in biology was and still is mainly based on ordi-nary differential equations (ODEs) for essentially two reasons: (i) ODEs are at leastmoderately well suited for handling the problems and (ii) handling and analyzingODEs is based on 350 years experience from physics and mathematics. Partial dif-ferential equations (PDEs) are used for the description of migration and spreadingof populations in space, in particular in ecological and epidemiological models, andfor modeling morphogenesis. Apart from ODE and PDE models difference equa-tions are used to describe discrete generations, and stochastic modeling by meansof Markov processes is applied to analyze problems in biology. Markov models areparticularly important in phylogeny.

Classical phylogeny was based on morphological comparisons which, althoughsuccessful, escaped quantitative analysis and handling. The advent of molecular bi-ology and the fast growing facilities for determination of biopolymer sequences andstructures changed the scene entirely. Sequencing was developed first for proteinsand later for nucleic acids, it laid down the basis for sequence comparisons andinitiated the field of molecular evolution. Since twenty years a true explosion ofavailable DNA sequences is taking place and the dramatically increasing demandsfor electronic storage facilities and retrieval as well as the requirements for new toolsfor data analysis are challenging computer science and led to the development of thenovel discipline of bioinformatics. The enormous increasein computing power anddata storage capacities, eventually, revolutionized alsobiological modeling and sim-ulation, computational biology became a field of research inits own right. Discretemathematics, in particular combinatorics and graph theoryare now indispensablefor biological modeling not only in molecular phylogeny butalso for nucleic acidstructure predictions and several other fields.

A

A

C

G

U

A

U

U

C

C

C

C

A

A

G

G

U

U

C

A

Sequence 1

Sequence 2

d

Fig. 7 Sequence evolution. The evolution of sequences is described by substitutionalchangesof single positions during a certain time span∆t, measured in number of substitution per sited.For independent-sites models we assume that positions evolve independently (as indicated by theframed boxes) and according to the same process (see also theclosely related uniform error ratemodel in section 2.3.4).

Page 11: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 11

Fig. 8 Standard models. Independent evolution ofsites, the Markov property, and continuous time areassumed, and the|A |× |A | instantaneous rate matrixis defined byQ = {(Q)i j ; i, j = 1, . . . , |A |}, and ma-trix, where|A | = κ is the number of character states.This chapter will mainly consider RNA evolution withA = {A,C,G,U}, hence|A |= κ = 4. For binary andnucleotide sequences, amino acid sequences in pro-teins and codon sequences in genes, the alphabet sizeis 2, 4, 20 and 64, respectively.

A

U

C

G

2.1 Phylogenetic models

Sequence structuringduring a time span∆ t is described by the evolution of se-quences along a tree through single nucleotide substitutions at independent positions(Fig.7). Explicit formal sequence evolution models are most prominently analyzedby maximum likelihood approaches [e.g. 54, 91, 93, 157]. Clearly, the developmentof such models requires more or less well justified assumptions concerning the evo-lutionary process.

2.1.1 Commonly used independent-site sequence evolution models

Molecular sequence data provide a recording of characters at a given instant, e.g.at time t0, and contain no direct information on the rate at which they are evolv-ing. During evolution, mutation and natural selection act upon molecules that areintegrated in an organism, and molecules have no knowledge of their previous his-tory. Such alack of memoryis one of the basic assumptions of phylogeny, and inthe theory of stochastic processes this property is attributed toMarkov processes.In other words, the evolutionary future depends only on the current state and noton any previous or ancestral state. Under the standard assumption of astationary,time homogeneous and reversible Markovian substitution processone can, for ev-ery positive∆ t, compute the probability of nucleobaseB j at positionsk under thecondition that nucleobaseBi has been at this position in the initial state (t(0) = t0),which evolves intoB j within the time span∆ t = t − t0. To define such a process

1. Each site of the sequence evolves independently.2. The substitution process has no memory of past events (Markov property).3. The process remains constant through time (homogeneity).4. The process starts at equilibrium (stationarity).5. Substitutions occur in continuous time.

Table 1 Assumptions of the commonly used nucleotide substitutionsmodels, of which some arecollected in Tab. 2.

Page 12: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

12 Tanja Gesell and Peter Schuster

one only has to specify an instantaneous rate matrixQ, in which each entryqi j > 0stands for the rate of change from stateBi to stateB j during an infinitesimal periodof time dt, illustrated in Fig.8. The diagonal elements of matrixQ are defined suchthat the individual rows sum up to zero by

qii = −κ

∑j=1, j 6=i

qi j andκ

∑j=1

qi j = 0 . (4)

This definition is readily interpreted:qi j is the rate determining the change at agiven positionsk from Bi to B j and∑ j , j 6=i qi j is the rate for a change fromBi toany other nucleobase at this position. The diagonal elementqii = −∑ j 6=i qi j , the so-called waiting time, thus determines how fastBi is replaced by another nucleobase.

The rate matrixQ can be formulated in terms of time–dependent probabilities,which are elements of a probability matrixP(t): Pi j (t) is the probability to findB j

at timet at a given positionsk if Bi was the nucleobase at timet0 at this position.From the definition of the matrixQ for an infinitesimaldt and time homogeneity ofthe process follows

P(t) = eQ t ·P(0) = eQ t , (5)

sinceP(0) is the unit matrix for the definition of probabilities used above. Equ.(5)is rather a formal relation that cannot be evaluated analytically, because the expo-

A C G T A C G T A C G T

JC69 K80 HKYA ∗ α α α ∗ β α β ∗ β πC απG β πT

C α ∗ α α β ∗ β α β πA ∗ β πG απT

G α α ∗ α α β ∗ β απA β πC ∗ β πT

T α α α ∗ β α β ∗ β πA απC β πG ∗TN93 F81 GTR

A ∗ β πC α1πG β πT ∗ πC πG πT ∗ aπC bπG cπT

C β πA ∗ β πG α2πT πA ∗ πG πT aπA ∗ dπG eπT

G α1πA β πC ∗ β πT πA πC ∗ πT bπA dπC ∗ f πT

T β πA α2πC β πG ∗ πA πC πG ∗ cπA eπC f πG ∗

Table 2 Instantaneous rate matrices. The table contains six different rate matrices that are basedon the assumptions summarized, for convenience, in Tab. 1. The first model was developed byJukes and Cantor [91], and is specified by a single free parameter since it assumes equal mutationrates between all nucleobases. The other models ordered by increasing number of parameters areK80: Motoo Kimura’s two parameter model distinguishes between transitions and transversions[93],and the more general single substitution models, which consider different base compositions,in particular HKY: the Hasegawa-Kishino-Yano model [78], TN93: the Tamura-Nei model [156],F81: the Felsenstein 81 model [54], and GTR: the general timereversible model[127]. The diagonalelements(∗) are given by Equ. 4.

Page 13: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 13

nential function of a general matrix can only be expressed asan infinite series ofmatrix products or computed numerically in terms of the eigenvalues and eigen-vectors of the matrixQ. In order to turn the probabilities into a dynamical model

we define a normalized vector of nucleobase frequencies,πππ(t) =(

π1(t), . . . ,πκ(t))′

with ∑κi=1 πi(t) = 1,2 which describes the nucleobase distribution at timet. The time

development is therefore expressed in form of a differential equation

dπππdt

= Q ·πππ and πππ(t) = eQ t ·πππ(0) . (6)

The equation on the rhs is identical with Equ.(5) and again solutions are acces-sible only through numerical computation of the eigenvalueproblem forQ (seealso section 2.3.4). Independently of the initial valuesπππ (0) = πππ(0) the nucleobasedistribution converges towards a stationary distributionπππ = (π1, . . . , π|A |)

′. The in-stantaneous rate matrixQ can be multiplied by an arbitrary factorf > 0 that canbe absorbed in timeτ = t · f−1, since time and rate are confounded and only theirproduct can be inferred without extrinsic information [54]. We typically scale timesuch that the expected rate of substitutions per site is

−∑i

πiqii = 1. (7)

Due the multiple substitutions we never observed. We rather observe the number ofdifferencesp, that is computed as

p = 1−∑i

πi pii(t). (8)

The most common models use independence along the sites as well as the otherassumptions stated in Tab.1. It is worth noticing that not every stationary process isreversible. Reversibility, however, is a convenient and reasonable assumption madeby the most commonly used models. A stationary Markov process is said to betimereversible if,

πiqi j = π jq ji (9)

Under the reversibility assumption, the twelve nondiagonal entries of rate matrixQcan be described by six symmetric terms, the so-called exchangeability terms andfour stationary frequencies. Thus, the general time reversible model (GTR) of Tab. 2has 8 free parameters. The further assumption that the rate of substitution is thesame for all nucleotides, can be relaxed by including rate heterogeneity, assumingthat many of the complexities of molecular evolution are primarily manifested as adifference in the relative rate for sites changes, whilst maintaining all other aspectsof the evolutionary process.

2 For convenience we defineπππ as column vector but to save space we write it as a transposed rowvectorπππ ′.

Page 14: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

14 Tanja Gesell and Peter Schuster

Rate Heterogeneity

Each site has a defined probability of evolving at a given rate, independently ofits neighbors. The distribution of relative ratesr is frequently assumed to followa gamma distribution [165, 178]. Further developments, breaking the distributioninto a pre-specified number of categories, make the model more computationallyefficient [179]. Van de Peer et al. [166] used empirical pair-wise methods to infersite-specific rates of alignment positions. Based on this idea, Meyer et al. [106]introduced a maximum likelihood framework for estimating site-specific rates frompairs of sequences with an iterative extension to compute site-specific rates andthe phylogenetic tree simultaneously. In the sense of reflecting different selectiveconstraints at different sites hidden Markov models (HMM) are used to assign ratesof change to each site, according to a Markov process that depends on the rate ofchange at the neighboring site. These approaches model sitedependencies throughshared rate parameters while still assuming independent changes at the differentsites [e.g. 56]. Since we are interested in the evolution of RNA structure, includingdifferent constraints such as energy values, it makes senseto focus on relaxing theassumption about independent evolving sites.

2.1.2 From independent to jointly modeling substitution events

For many years, various authors have attempted to overcome the assumption of inde-pendence across sites. In the simplest cases, a dependence structure with joint sub-stitution events is modeled. A prominent example are codon models, which considerthe dependence of neighboring sites within a codon [71, 111]rather than mononu-cleotide models in protein coding regions. The assumption of independence amongsites is naturally relaxed in case the whole nucleobase triplet of a codon is taken asa unit of evolution. Evidently, this leads to larger rate matrices and makes compu-tations more demanding, but for protein coding regions suchmodels are compara-tively more realistic.

In the same way, Schoniger and von Haeseler [134] have suggested model-ing of joint substitution events in RNA helical regions. Clearly, the nucleotidesin stem regions of RNA molecules cannot evolve independently of their base-pairing counterparts. Given the frequencies of admissiblebase-paired doublets, itis quite likely that substitution at a non-base-pairing doublet (fig.9b) will lead toa base-paired doublet within a relatively short time interval as we would expectin the case of so-called compensatory mutations (see fig.9).The sites are classi-fied into two categories: (i) helical regions and (ii) loops,joints, and free ends –however, with the assumption of a fixed RNA secondary structure. While the unitsof loop regions are mononucleotides, as in the conventionalindependent models(Tab. 2), the units of helical regions are doublets (base-pairs). In the helical regions,the state space is extended to all 16 pair combinations,Bi ,B j ∈ B = A ×A ={AA ,AC,AG, . . . ,GU,UU}. Correspondingly, the F81 model (Tab. 2) is extendedto a 16× 16 instantaneous rate matrix by taking into account the stationary fre-

Page 15: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 15

GCUUC

CGAAG

GC

UC

CG

AG

GCUUC

CGAAG

G A

(a) (b) (c)GCGUC

CGCAG

(d)GCGUC

CGUAG

(e)

Fig. 9 Compensatory mutation in an RNA double helix. A substitution within a double helicalregion destroys a base pair and creates an internal loop: (a)→(b). Further substitution at the internalloop may restore the double helical structure either by reverting the mutation, (b)→(c), or by cre-ating other base pairs at this position (b)→(d) and (b)→(e). The gain in fitness accompanying basepair formation is mirrored by high probabilities for such events, which is tantamount to relativelyshort waiting times (see text).

quenciesπππB = {πAA, πAC, πAG, . . . , πGU, πUU} for the usual restriction that only onesubstitution per unit time is admissible.

Several other attempts to integrate site interactions of RNA base pairing intoa Markov model of sequence evolution have been made [110, 129, 160, 162, 161].Muse [110] introduced a new pairing parameterλ to account for the effects of form-ing or destroying a base pair. Since stem structures are almost always favored by lowGibbs free energy, one might be tempted to conclude that the relative probability ofa change from an unpaired state to a paired state should be greater than the proba-bility for the opposite change, whereas changes from a paired state to an unpairedstate occur with lower frequencies. This explanation, however, contradicts empiricalobservation, because there is no unique trend towards GC rich structures. A moresatisfactory argument for the probabilistic preference ofcompensatory mutationsis that the original stem structure had (substantially) higher fitness than the mutantwith the internal loop, and higher fitness facilitates the appearance of the compen-satory mutant in the population.

The six possible base pairsB = {AU,UA,GC,CG,GU,UG} are explicitly con-sidered in a 6×6 instantaneous rate matrixQ [160], or augmented by one state forall mismatch pairs in a 7×7 matrix [162]. Commonly, the simultaneous substitutionrate of both nucleobases in a base pair are assumed to be zero.However, there arealso models which allow doublet substitutions [133]. SinceMotoo Kimura’s earlyworks on the rates of nucleobase substitutions [93], the observation of compensatorymutations has been thought to explain the conservation of structure. Among furtherstudies on the rate of compensatory mutations we mention [85, 150] here. Rate ma-trices for RNA evolution were also determined empirically from a large sample ofrelated RNA sequences [148]. A more recent investigation [147] has shown thatdifferent secondary structure categories evolve at different rates.

Page 16: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

16 Tanja Gesell and Peter Schuster

2.1.3 Context dependent substitutions

Relaxing the assumption of independently evolving sequence fragments throughthe introduction of overlapping dependencies makes the models substantially moreinvolved. Nevertheless, accounting for stacking interactions between pairs of ad-jacent residues in RNA helices and other energetic constraints in RNA structure(see chapters by Ivo Hofacker and Christian Zwieb LATEX) would definitely improvethe relation between substitutions and RNA structure. Not unexpectedly, empiricalstudies have found indications that the assumption of independent evolution of sitesincluding the dependency on flanking sequence patterns is too restrictive and doesnot match the experimental data [e.g. 24, 108]. Pioneering work of context depen-dent substitutions with respect to the CpG-deamination process has been done byJens Jensen and Anne-Mette Pedersen [87] using a Markov model of nucleotide se-quence evolution in which the instantaneous substitution rates at a site were allowedto depend on the states of a neighboring site at the instant ofthe substitution. Sofar they were able to consider pairs of sequences only. The model consists of a firstcomponent that depends on the type of change while the secondcomponent consid-ers the CpG-deamination process. Later the model has been extended by accountingfor arbitrary reversible codon substitutions with more flexibility concerning CpG-deamination [27]. In these approaches, inference is obtained using Markov chainMonte Carlo (MCMC) of expectation maximization (EM) based pseudo-likelihoodestimation and phylogenetic reconstruction is still a problem [5]. Models reducingthe context were developed that allow for context dependentsubstitutions but can beapproximated by simple extensions of Joe Felsenstein’s original framework. Thesemodels impose certain limitations on the independence of sites but allow for exactinference without too much additional cost in computation efforts (an example is theapproach described in [145]). Nevertheless, some of these models are analyticallysolvable as was recently shown [7] for a special subcase of a model suggested byTamura [156] and extended by the inclusion of CpG doublets. These works take up

Sequence 1

Sequence 2

d

ACU C A GC

ACU C G GC

Fig. 10 Context dependent substitutions. The sketch indicates that the rate for the substitution ofthe central nucleobase, A→G, depends on three further nucleobases on each side (the framed boxcomprises all seven nucleobases). Modeling global contextdependency specifies rates of changefrom every possible sequence to every other possible sequence with the usual restriction of nomore than one position being allowed to change in a particular instant. This dependency structureleads to rate matrices of high dimensionality, which require special approaches to overcome thisproblem (see text for details and section 2.3.4 for an alternative approach).

Page 17: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 17

earlier ideas on the solvability of this class of models suggested in [40] and formallyproven later [3].

Global context dependency models related to considering protein structure werefirst developed in Jeffrey Thorne’s group using a Bayesian MCMC [126]. They de-fined an instantaneous rate matrix that specifies rates of change from every possiblesequence every other possible sequence with the common restriction of no morethan one position being allowed to change at a particular instant. This approach hasbeen modified in order to explore a possibility where the relative rate of sequenceevolution is affected by the Gibbs free energy of RNA secondary structure [182].The results are slightly ambiguous because medium size and large RNA moleculeshave natural orexpectedstructures that differ substantially from the minimum freeenergy (mfe) structures and they may be constrained by structure and function ratherthan by thermodynamic stability (see section 2.3.1). Taking into account the possi-ble dependence of structure on all sites, the rate matrixQ is 4l ×4l for sequenceor chain lengthl , and any full analysis ofQ is not feasible even ifl is extremelysmall (for l = 10 the dimension ofQ is already 410×410). To overcome this highdimensionality, they use a sequence path approach [87, 117].

It is a well known fact in optimization theory that too many additional parame-ters for modeling add noise and imply a risk of overfitting thedata, and this equallytrue for site dependence models. Careful strategies for model building are requiredin order to balance additional parameters with the available empirical information.Such models call for extensive analysis of their mathematical behavior, for exam-ple stability and convergence to equilibrium, and for investigations of their abilityto predict the statistical properties observed with biological sequences. As a conse-quence, computer simulation seems to be the most appropriate tool for identifyingthe necessary parameters for modeling site dependence.

A general simulation framework, which takes into account site-specific interac-tions mimicking sequence evolution with various complex overlapping dependen-cies among sites, has been introduced by Tanja Gesell and Arndt von Haeseler [66].This framework is based on the idea of applying different substitution matrices ateach site (Qk;k = 1, . . . , l ), which are defined by the interactions with other sites inthe sequence. To this end, a neighborhood systemN = (Nk)k=1,2,...,l is introducedsuch thatNk ⊂ {1, . . . , l},k /∈ Nk for eachk and if i ∈ Nk thenk∈ Ni for eachi,k. Nk

contains all sites that interact with sitesk. With nk they denote the cardinality ofNk,i.e. the number of sites that interact withsk. ThusQ = {Qk;k = 1, . . . , l} constitutesa collection of possibly different substitution modelsQk acting on the sequence andan annotation of correlations among sites (Fig.11). The approach allows for mod-eling evolution of nucleotide sequences along a tree for user defined systems ofneighborhoods and instantaneous rate matrices.

Page 18: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

18 Tanja Gesell and Peter Schuster

16 x 16 4 x 464 x 64

x

3 2 2 01

256 x 256 64 x 64Q

n

k

k

q(x) = q(150) q(223) q(l)q(250)q(153)... + ... ... ... + ... ... + ... ... + ...

150 153 223 250 l

Fig. 11 A ribozyme domain with overlapping dependencies. Example of a sequenceS withoverlapping dependencies on site 153. Such dependencies occur, for example, in ribozyme domains[26]. The substitution rate for the whole sequenceq(S) is obtained as the sum of the rates of each

site q(S) = ∑lk=1 Qk(B

(k)j ,B(k)

j ). The nucleobase instantaneous substitution rate depends on thestates of the neighborhood system of this site at the instantof the substitution, described in theinstantaneous rate matrixQk. The dimension ofQk depends on the number of neighborsnk takinginto account at this sitek.

Fig. 12 Genetic and ap-parent distance. The ge-netic distanced(g), withg(t) being a monotonouslyincreasing function of re-altime, counts all substitu-tions whereas the observeddistancep(g) is reduced bydouble and multiple substi-tutions at the same site andhence d ≥ p. The Jukes-Cantor model [91] is used.

2.2 Time in phylogeny and evolution

Evolutionary dynamics is commonly described in realtime orin other words, thetime axis in Figs.5 and 6 is physical time. The morphologicalreconstruction of bi-ological evolution makes extensive use of the fossil recordand hence is anchoredin stratigraphy and other methods of palaeontological timedetermination. Phyloge-netic reconstructions, on the other hand, are counting substitutions and thus not de-pendent on absolute timing. Without loosing generality we assume that substitutionsare related to realtime by some monotonously increasing functiong(t). It is impor-

Page 19: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 19

tant to distinguish a genetic distanced(g) being the number of substitutions andthe observable distancep(g). Because of double and multiple substitutions, whichremain undetectable by sequence comparisons,d ≥ p is always fulfilled (Fig.12).Indeed,p(g) converges to some saturation value limg→∞ p(g) = p∞. At this pointthe two sequences carry no more information on their phylogenetic relatedness. As-suming independent substitution events,p depends on chain lengthl and in the limitof long chainsp andd become equal: liml→∞ p(l) = d.

The functiond(g) relating substitutions to real time clearly depends on the evo-lutionary model, in particular on the rate matrix of substitutions (Table 2). In thesimplest possible case, the Jukes-Cantor model we havep(g) = 3(1− e−4g/3)/4andd(g) = g, at small values ofd we havep≈ d as required for the model, and thesaturation value isp∞ = 3d/4. For more models we refer to the monograph [112,p.265ff.]. As an example of an analysis of very ancient RNA genes that are alreadyclose to the saturation limit at present we mention the attempt to construct a singlephylogenetic tree for all tRNAs [50].

Historically, the first and also the most spectacular attempt to relate phylogenyand time was the hypothesis of a molecular clock of evolution[99, 107, 154], whichis tantamount to the assumption of a linear relationg = λ t: The numbers of mu-tational changes in informational macromolecules, proteins and nucleic acids, areconstant through time and possibly independent of the particular lineage in the senseof a universalλ value. The molecular clock hypothesis seemed to be in excellentagreement with the neutral theory of evolution [94], although the lack of influenceof the generation time has always been kind of a mystery. Moreand more accuratedata, however, have shown that the pace of the molecular clock varies with the par-ticular protein under consideration. In addition, the molecular clock was found totick differently in different lineages and for comparisonsof sequences that divergedrecently or long time ago [4]. Although the usefulness of themolecular clock asan independent source of timing has never been seriously questioned by molecularevolutionary biologists [21], a deeper look at data and models more deeply revealedmany deviations from the naıve clock hypothesis [20, 80, 128].

2.3 Evolution and design of RNA structures

The choice of RNA as model system has several reasons: (i) a large percentage ofthe free energy of folding comes from Watson-Crick base pairing, and base paringfollows a logic that is accessible to combinatorics, (ii) a great deal but definitely notnearly all principles of RNA function can be derived from easy to analyze and pre-dict secondary structures, (iii) evolution of RNA molecules can be studiedin vitroand this fact found direct application in the design of molecules with predefinedproperties, and (iv) viroids, RNA viruses, their life cycles and their evolution can bestudied at molecular resolution and provide insight into a self-regulated evolution-ary process. The knowledge on molecular details of RNA basedentities with partlyautonomous life cycles provides direct access to a world of increasing complexity

Page 20: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

20 Tanja Gesell and Peter Schuster

-26.0

-24.0

-22.0

-20.0

-18.0

-16.0

-14.0

-12.0

-10.0

- 8.0

- 6.0

DG

[k

cal.m

ole

]-1

Y0

Y2

Y3

Y7,8

Y9

2

3

87

9Y10

Y6Y4,5

456

10

kinetic structuressuboptimal structuresminimum free energy structure

Y0

Y1

Y0Y1

GGCCCCUUUGGGGGCCAGACCCCUAAAGGGGUC

Fig. 13 RNA secondary structures viewed by thermodynamics and folding kinetics. An RNAsequenceS of chain lengthl = 33 nucleotides has been designed to form two structures: (i)thesingle hairpin mfe structure,Y0 (red) and (ii) a double hairpin metastable structure,Y1 (blue). TheGibbs free energy of folding (∆G) is plotted on the ordinate axis. The leftmost diagram showsthe minimum free energy structureY0 being a single long hairpin with a free energy of∆G =−26.3 kcal/mole. The plot in the middle contains, in addition, the spectrum of the ten lowestsuboptimal conformations classified and color coded with respect to single hairpin shapes (red)and double hairpin shapes (blue). The most stable – nevertheless metastable – double hairpin hasa folding free energy of∆G = −25.3 kacl/mole. The rightmost diagram shows the barrier tree ofall conformations up to a free energy of∆G = −5.6 kcal/mole where the energetic valleys for thetwo structures merge into one basin containing 84 structures, 48 of them belonging to the singlehairpin subbasin and 36 to the double hairpin subbasin. A large number of suboptimal structureshas free energies between the merging energy of the subbasins and the free reference energy of theopen chain (∆G = 0).

from molecules to viroids, from viroids to viruses and so on –all being now anenormous source of molecular data that wait to be converted into comprehensiveand analyzable models of, for example, host-pathogen interactions and their evo-lution. Nowhere else is the relation between phylogeny and functional evolution –sometimes even within one host – so immediately evident as with RNA viruses.

2.3.1 Notion of structures

Notion, analysis and prediction of RNA structures are discussed in other chapters ofthis edited volume, in particular, chapter LATEX by Ivo Hofacker and chapter LATEXby Christian Zwieb. Here, we want to concentrate on one aspect of RNA structuresthat became clear only within the last two decades: In the great majority of natural,evolutionary selected RNA-molecules we are dealing with sequences forming a sin-gle stable structure, whereas randomly chosen sequences generically form a greatvariety of metastable suboptimal structures in addition tothe minimum free energystructure [138]. Important exceptions of the one sequence-one structure paradigm

Page 21: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 21

are RNA switches fulfilling regulatory functions in nature [73, 103, 144, 173] andsynthetic biology [17]. Such riboswitches are multiconformational RNA molecules,which are involved in posttranscriptional regulation of gene expression. The confor-mational change is commonly induced by ligand binding or ribozymic RNA cleav-age. Multitasking by RNA molecules clearly imposes additional constraints on ge-nomics sequences and manifests itself through a higher degree of conservation inphylogeny.

2.3.2 Sequences with multiple structures

The extension of the notion of structure to multiconformational molecules is sketchedin Fig.13. The RNA molecule forms a well defined minimum free energy (mfe)structure – being a perfect single hairpin (lhs of the figure). In addition to the mfestructure, the sequenceS like almost all RNA sequences3 forms a great variety ofother, less stable structures calledsuboptimal structures(shown in the middle of thefigure). Structures, mfe and suboptimal structures, are related through transitions,directly or via intermediates, which in a simplified versioncan be represented bymeans of a barrier tree [58, 175] shown on the rhs of the figure.Kinetic foldingintroduces a second time scale into the scenario of molecular evolution.4 Based onArrhenius theory of chemical reaction rates,

k = A · e−Ea/RT , (10)

the height of the barrier,Ea determines the reaction rate parameterk and thereby thehalf life of the conformationt1/2 = ln 2/k. In equation (10),A is the pre-exponentialfactor of the reaction,R is the gas constant andT the absolute temperature in◦ Kelvin. The two structures shown in Fig.13 are connected by alowest barrier of20.7 kcal/mole that depending on the pre-exponential factors implies half lives ofdays or even weeks for the two conformations. In a conventional experiment witha time scale of hours the two conformations would appear as two separate entities.Barriers, nevertheless, can be engineered to be much lower and then an equilibriummixture of rapidly interconverting conformations may be observed. Several con-straints are required for the conservation of an RNA switch,and the restrictions ofvariability in sequence space are substantial.

The comparison of the two dominant structuresY0 and Y1 in Fig. 13 pro-vides a straightforward example for the illustration of different notions of stabil-ity: (i) thermodynamic stability, which considers only thefree energies of the mfestructures –Y0 in Fig.13 is more stable thanY1 since it has a lower free energy,∆G(Y0) < ∆G(Y1), (ii) conformational stability, which can be expressed in termsof suboptimal structures or partition functions within a basin or a subbasin of anRNA sequence – a conformationally stable molecule has no lowlying suboptimal

3 Exceptions are only very special sequences, homopolynucleotides, for example.4 Timescale number one is the evolutionary process itself. Inorder to be relevant for evolutionarydynamics the second timescale has to be substantially faster than the first one.

Page 22: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

22 Tanja Gesell and Peter Schuster

Fig. 14 A sketch of the mapping of RNA sequences onto secondary structures. The points ofsequence space (here 183 on a planar hexagonal lattice) are mapped onto points in shape space(here 25 on a square lattice) and, inevitably, the mapping ismany to one. All sequencesS. fold-ing the same mfe structure form a neutral set, which in mathematical terms is the preimage ofYkin sequence space. Connecting nearest neighbors of this set– these are pairs of sequences withHamming distancedH = 1 – yields the neutral network of the structure, Gk. The network in se-quence space consists of a giant component (red) and severalother small components (pink). Onthe network the stability against point mutations varies from λ = 1/6 (white points) toλ = 6/6= 1(black point). We remark that the two-dimensional representations of sequence are used here onlyfor the purpose of illustration. In reality, both spaces arehigh-dimensional – the sequence space of

binary sequencesQ(2)l , for example, is a hypercube of dimensionn and that of natural four-letter

sequencesQ(4)l an object in 3n dimensional space.

conformations that can be interconverted with the mfe structure at the temperatureof the experiment; for kinetic structures separated by highbarriers the partition func-tions are properly restricted to individual subbasins [104], and (iii) mutational sta-bility that is measured in terms of the probability with which a mutation changes thestructure of a molecule (see Fig.14 and section 2.3.3). All three forms of stabilityare relevant for evolution and phylogeny, but mutational stability and the spectrumof mutational effects – adaptive, neutral or deleterious – are most important.

2.3.3 Mapping sequences onto structures

RNA secondary structures provide a simple and mathematically accessible exampleof a realistic mapping of biopolymer sequences onto structures [138, 140]. Heremodeling will be restricted to the assignment of a single structure to each sequence(lhs diagram in Fig.13). Apart from a few exceptions, experimental information onconformational free energy surfaces and fitness landscapesis rather very limited butthe amount of available data is rapidly growing.Realisticmeans here that the neigh-borhood relations are similar to the observations in reality: (i) realistic landscapesarerugged, since nearest neighbor, i.e. Hamming distance one (dH = 1), sequences

Page 23: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 23

may have entirely different or very similar properties, and(ii) realistic landscapesare characterized byneutrality in the sense that different sequences may lead tosimilar or identical structures and may have indistinguishable properties for evo-lution. Both features are observed in mappings of RNA sequences onto secondarystructures.

The mapping in the forward direction, i.e. from sequence space onto a space ofstructures calledshape space, is formalized by

Φ :(

Q(κ)l , dH

)

=⇒ (Yl , dY) or Y = Φ(S) . (11)

As said in the introduction, both sequence and structure space are metric spaceshaving the Hamming distance(dH) and an appropriate structure distance(dY) asmetrics. The superscript ‘κ ’ is the size of the nucleobase alphabet (κ = 4 for naturalsequences). The subscript ‘l ’ indicates that, for simplicity, we restrict the considera-tions here to sequences of the same lengthsl . A setΓ (Yk) containing all sequencesfolding into the same structureYk is denoted asneutral set, it represents the preim-age of the structure in sequence space and it is defined by

Γk = Γ (Yk) = Φ−1(Yk) ={

S j |Yk = Φ(S j)}

. (12)

The neutral network Gk is the graph obtained from the setΓk by connecting all pairsof nodes with Hamming distancedH = 1 by an edge (see Fig.14). The probabilityof a change in structure as a consequence of a mutation of the sequenceS j ∈ Γk ismeasured by the local degree of neutrality

λk(S j) =# neutral mutations ofS j

# all mutations ofS j. (13)

Accordingly,λk(S j) = 0 implies that every mutation ofS j leads to a new structureandλk(S j) = 1 expresses the fact that the structureYk is formed by all single pointmutants ofS j , which might be characterized as amutation resistentsequence. Anaverage of the local degree of neutrality is taken over the entire neutral setΓk toyield λk = ∑S j∈Γk

λk(S j)/|Γk|, the (mean or global) degree of neutrality, which is ameasure for the mutational stability of the structureYk as discussed in section 2.3.2.Typical λk-values for common RNA secondary structures – as found in nature orobtained by evolution experiments with RNA molecules – lie in the range 0.2 <λk < 0.3, implying that roughly 25 % of all mutations are neutral with respect tostructure. The degree of neutrality for complete three-dimensional structures is notknown and additional definitions are required, because spatial structures are pointsin a continuous rather than discrete shape space. Coarse graining of structures isnecessary in order to be able to distinguish alike from different.

Page 24: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

24 Tanja Gesell and Peter Schuster

Fig. 15 The quasispecies as a function of the point mutation rate p. The plot shows the station-ary mutant distribution of sequences of chain lengthl = 50 as a function of the point mutation ratep. The upper part contains the approximation by perturbationtheory according to equation (16)and is compared with the exact results presented in the lowerpart of the figure. Plotted are the rel-ative concentration of entire mutant classes: ¯y0 (black) is the master sequence, ¯y1 (red) is the sumof the concentrations of all one error mutants of the master sequence, ¯y2 (yellow) that of all twoerror mutants, ¯y3 (green) that of all three error mutants, and so on. In the perturbation approach theentire population vanishes at a critical mutation ratepcr called the error threshold (which is indi-cated by a broken gray line atpcr = 0.04501) whereas a sharp transition to the uniform distributionis observed with the exact solutions. Choice of parameters:fm = 10, fi = 1∀ i = 1, . . . ,n; i 6= m.

2.3.4 Chemical kinetics of evolution

In order to introduce mutations into selection dynamics Manfred Eigen [42] con-ceived a kinetic model based on stoichiometric equations, which handle correctreplication and mutation as parallel reactions

(A) + Si

Qji fi−−−−→ S j + Si ; i, j = 1, . . . ,n . (14)

In normalized coordinates (14) corresponds to a differential equation of the form

Page 25: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 25

dxj

dt=

n

∑i=1

Q ji fi xi − x j φ(t) ; j = 1, . . . ,n ;n

∑i=1

xi = 1 . (15)

The finite size constraintφ(t) = ∑ni=1 fixi is precisely the same as in the mutation-

free case (3), and the same techniques can be used to solve thedifferential equation[88, 159]. Solutions of (15) are derived in terms of the eigenvectors of then× nmatrix W= Q ·F = {Wji = Q ji fi}. The mutation frequencies are subsumed in thematrix Q= {Q ji} with Q ji being the probability thatS j is obtained as an errorcopy ofSi . The fitness values are the elements of a diagonal matrix F= {Fi j = fi ·δi, j}. Exactsolution curves and stationary mutant distributions (Π ) are obtained bynumerical computation. The stationary populations have been calledquasispeciessince they represent the genetic reservoirs of asexually reproducing species.

A quasispecies exists if every sequence in sequence space can be reached fromevery other sequence along a finite-length path of single point mutations in the muta-tion network.5 In addition the quasispecies contains all mutants at nonzero concen-trations (xi > 0∀ i = 1, . . . ,n). In other words, after sufficiently long time a kind ofmutation equilibriumis reached at which all mutants are present in the population. Inabsence of neutral variants the quasispecies consists of amaster sequence, the fittestsequenceSm : { fm = max( fi ; i = 1, . . . ,n)}, and its mutants,S j ( j = 1, . . . ,n, i 6= m),which are present at concentrations that are, in essence, determined by their ownfitness f j , the fitness of the masterfm and the off-diagonal element of the muta-tion matrixQ jm that depends on the Hamming distance from the master sequencedH(S j ,Sm)

(

(see Equ.(18))

.The coefficient of the first term in Equ. (15) for any sequenceS j consists of

two parts: (i) the selective valueWj j = Q j j · f j and (ii) the mutational flowω j =

∑ni=1,i 6= j Q ji · fi . For the master sequenceS j the two quantities, the selective value

of the master,Wmm and themutational backflowωm are of particular importance aswe shall see in the discussion ofstrong quasispecies(see below).

The error threshold.

Application of perturbation theory neglecting back mutations at zeroth order yieldsanalytical approximations for the quasispecies [42, 44], which are valid for suffi-ciently small mutation rates [153],Q ji << {Qii ,Q j j }(i 6= j):

xm ≈ Qmm−σ−1m

1−σ−1m

andx j

xm≈ Wjm

Wmm−Wj j, j = 1, . . . ,n; j 6= m

with σm = fm/

f−m and f−m =n

∑k=1,k6=m

f j x j/

(1− xm) .

(16)

5 By definition of fitness values,fi ≥ 0, and mutation frequencies,Q ji ≥ 0, W is a non-negativematrix and the reachability condition boils down to the condition: Wk � 0, i.e. there exists ak suchthat Wk has exclusively positive entries and Perron-Frobenius theorem applies [143].

Page 26: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

26 Tanja Gesell and Peter Schuster

The superiorityσm is a measure of the advantage in fitness the master has over therest of the population, andf−m is the mean fitness of this rest.6

In order to gain basic insight into evolutionary dynamics weintroduce the uni-form error rate model: Thepoint mutation rate p(sk) is expressed in terms of a(mean) mutation rate per site and reproduction event that isassumed to be inde-pendent of the sitesk and the particular sequenceSi . Further simplification is intro-duced by the use of binary rather than four-letter sequences.7 Then, diagonal andoff-diagonal elements of matrix Q are of the simple form

Qii = (1− p)l and

Q ji = (1− p)l−dH(S j ,Si) pdH(S j ,Si) = (1− p)lεdH(S j ,Si)

with ε =p

1− p,

(17)

and insertion into Equ.(16) yields the three parameter (l , p,σ ) expression

xm ≈ (1− p)l −σ−1m

1−σ−1m

and x j ≈ εdH(S j ,Sm) fmfm− f j

xm . (18)

Equ.(18) provides a quantitative estimate for the concentrations of mutants: Forgiven p, x j is the larger the smaller the Hamming distance from the master,dH(S j ,Sm), is and the smaller the larger the difference in fitness,fm− f j is. Thestationary concentration of the master sequence, ¯xm, vanishes at some critical muta-tion ratep = pcr called theerror threshold[10, 42, 45, 46, 153]:8

pcr = 1 − σ−1/l =⇒ pmax ≈ ln σl

and l max ≈ ln σp

. (19)

Fig.15 compares a quasispecies calculated by the perturbation approach with theexact solution and shows excellent agreement up to the critical mutation ratep= pcr.This agreement is important because quantitative applications of quasispecies theoryto virology are essentially based on Equ.(16) (see section 2.3.7 and [34]).

At constant chain lengthl the error threshold defines a maximal error rate forevolution,p≤ pmax, and at constant reproduction accuracyp the length of faithfullycopied polynucleotides is confined tol ≤ lmax [46, 48]. The first limit of a max-imal error ratepmax has been used in pharmacology for the development of newantiviral strategies [35], and the second limit entered hypothetical modeling of early

6 An exact calculation off−m is difficult because it requires knowledge of the stationarycon-centrations of all variants in the population: ¯xi ; x = 1, . . .,n. For computational details see[44, 45, 47, 139].7 It should be noted that artificially synthesized two letter (DU; D = 2,6-diamino-purine) ribozymeshave perfect catalytic properties [124].8 Zero or negative concentrations of sequences clearly contradict the exact results described aboveand are an artifact of the perturbation approach. Nevertheless, the agreement between the exactsolutions and the perturbation results up to the error threshold as shown in Fig.15 is remarkable.

Page 27: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 27

biological evolution where the accuracy limits of enzyme-free replication confinethe lengths of polynucleotides that can be replicated faithfully [49].

Exact solutions and fitness landscapes.

Exact solutions of Equ.(15) computed numerically for asingle-peak fitness land-scape, fm = f0 and fj = fn∀ j = 1, . . . ,n; j 6= m, show a remarkably sudden changein the population structure at the criticalmutation rate,p= pcr, that manifests itselfin three observations (Fig.15):

(i) the concentration of the master sequence ¯xm, becomes very small – zero in theperturbation approach (16),(ii) an abrupt change in the population structure that sharpens with increasing chainlengthl in a phase transition like manner,(iii) a transition to the uniform distribution, ¯xi = κ−l ∀ i = 1, . . . ,κ l , which is theexact solution at the mutation ratep = p = κ−1.

The transition occurs already at small mutation rates far away from p (pcr = 0.045versus ˜p = 0.5 in Fig.15). Undoubtedly, evolution is not possible at point mutationratesp > pcr, since the existence of a uniform distribution implies thatreproductionis operating but inheritance is not: Too many errors are madein the copying processand no individual sequence can be stably maintained over many generations.

Early works [172] have shown that the occurrence of error thresholds depends onthe distribution of fitness values in sequence space. Onsimple landscapes, which aredistinguished fromrealistic, complex landscapes, all genotypes of equal distance tothe master have identical fitness values. Some smooth and simple fitness landscapes,in particular the additive,fm = f0 and f j = f0−α ·dH(S j ,Sm)∀ j 6= m, and the mul-tiplicative landscape,fm = f0 and f j = f0 ·β dH(S j ,Sm)∀ j 6= m, with α > 0 andβ < 1– which are both highly popular in population genetics – exhibit a gradual changefrom the homogeneous population, ¯xm = 1, at p = 0 to the uniform distribution atp = p in contrast to the single-peak landscape that shows the error threshold phe-nomenon (Fig.15).

As said above (section 2.3.3), two features are characteristic for realistic land-scapes: (i) ruggedness and (ii) neutrality [62, 84, 140]. Ruggedness can be mod-eled by assigning fitness differences at random within a bandof fitness values withadjustable widthd. The highest fitness value is assigned to the master sequence,fm = f0 and all other fitness values are obtained by means of the equation

f (S j) = fn +2d( f0− fn)(

η(s)j −0.5

)

, j = 1, . . . ,κ l ; j 6= m , (20)

whereη(s)j is the j-th output random number from a pseudorandom number gener-

ator with a uniform distribution of numbers in the range 0≤ η(s)j ≤ 1 that has been

Page 28: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

28 Tanja Gesell and Peter Schuster

Fig. 16 Quasispecies on model and realistic landscapes.The plots show exact solution curvesfor individual variants in the stationary distribution as afunction of the mutation rate, ¯xi(p). Thechain length isl = 10 corresponding to 1024 binary sequences on a fitness landscape defined byequation (20). The upper part of the figure was calculated with d = 0 corresponding to a single-peak landscape, whereas the lower part represents the results for a band widthd = 0.9375 close tofull randomness of fitness values. Choice of parameters:fm = 2, fi = 1∀ i = 1, . . . ,n; i 6= m, s= 491,d = 0 (upper part) andd = 0.9375 (lower part); color code: master sequence (black), 10 individualsingle point mutants (red), and 45 individual double point mutants (yellow).

started with the seeds.9 Typical results for the stationary mutant distribution areshown in Fig.16: Atd = 0 the landscape becomes the single-peak landscape and allmutants in a given mutant class have the same stationary concentration, ¯x j(p), theerror threshold manifests itself by the sharp decrease in the stationary concentrationof the master, ¯xm(p). For practical purposes it is useful to define a minimum valueof the stationary concentration in order to compute the threshold value of the mu-

9 The seeds indeed defines all details of the landscape that in turn is completely defined bys andthe particular type of the pseudorandom number generator.

Page 29: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 29

tation rate: ¯xm(pcr) = xmin for the calculations reported herexmin = 0.01 was used.The introduction of ruggedness has two effects: (i) The stationary solutions of thesequences belonging to one class split and form a band, whichbecomes broader asthe width of the random scatter,d, increases and (ii) the error threshold becomessharper and the critical mutation ratepcr moves towards smaller values ofp.

Strong quasispecies

Considering the entire range fromp = 0 to the error threshold atp = pcr it mayhappen that the one quasispecies,Π0, is replaced by another one,Π1 at some criticalmutation ratep = ptr [141] or, in other words,Π0 is the stable stationary mutantdistribution in the range 0≤ p ≤ ptr andΠ1 is stable in the rangeptr ≤ p≤ pcr –providedΠ1 is not becoming unstable in this range and then replaced by anotherquasispecies. The transition between the two quasispeciesdescribed in this paper[141, figure 11, p. 657] is extremely sharp and reminds of a phase transition. In thelimit l → ∞ the transition between two quasispecies is indeed a phase transitionin sequence space [45]. Eleven years later the same phenomenon was observed incomputer simulations based ondigital organismsand calledsurvival of the flattest.

The cause of the transition between two quasispecies is readily visualized: Atp = 0 all mutation ratesQ ji are zero and selection in the sense of survival of thefittest, the master sequenceSm with Wmm= fm = max{ fi ; i = 1, . . . ,n} according toEqu. (3a) is observed. Forp > 0 the effective fitness contains in addition to the se-lective value also the contribution from the mutational backflow. For a sequenceSk

with somewhat smaller fitness,fk < fm, or selective value,Wkk < Wmm, a more effi-cient mutational backflow can overcompensate the fitness difference. An straightfor-ward approximation [141] considers only the single point mutation neighborhoodsof the two sequencesSm andSk, which are assumed to be sufficiently distant insequence space. In order to be able to derive an analytical expression all one-errormutants are assumed to have the same fitnessfm+1 and fk+1,10 .respectively, withfk+1 > fm+1 – the less fit sequenceSk is situated in a neighborhood with higherfitness and the fitter sequenceSm has a less fit neighborhood. Next we consider thetwo 2×2 matrices for the two independent blocks:

M = (1− p)l

(

fm fm+1 lεfmε fm+1

(

1+(l +1)ε2)

)

and

K = (1− p)l

(

fk fk+1 lεfk ε fk+1

(

1+(l +1)ε2)

)

.

The largest eigenvalues of the two matrices,λm(p) and λk(p) are readily calcu-lated as functions of the mutation ratep. The conditionλm(ptr) = λk(ptr) defines

10 Alternatively one could think of these fitness values as meanvalues for the entire one errorclasses.

Page 30: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

30 Tanja Gesell and Peter Schuster

Fig. 17 Quasispecies on a realistic landscape supporting a quasispecies with multiple phasetransitions. The plots show exact solution curves for individual variants in the stationary distri-bution as a function of the mutation rate, ¯xi(p). The chain length isl = 10 corresponding to 1024binary sequences on a fitness landscape defined by equation (20). The upper part of the figurewas calculated withd = 0.5 and shows typical quasispecies behavior with an error threshold atpcr = 0.0105, the lower part with fully developed scatterd = 1.0 shows three distinct phase transi-tions marked astr1 (ptr1 = 0.000655),tr2 (ptr2 = 0.001792) andtr3 (ptr3 = 0.002790), and even-tually an error threshold atpcr = 0.0083 (outside the plot). Choice of parameters:fm = f0 = 1.1,fn = 1.0, s= 637; color code: master sequence (black), 10 individual single point mutants (red),and 45 individual double point mutants (yellow),. . . , six error mutants (magenta), seven errormutants (chartreuse), eight error mutants (yellow), nine error mutants (red) and eventually the tenerror mutant (black).

Page 31: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 31

Fig. 18 Quasispecies on a realistic landscape supporting astrongquasispecies.The plots showexact solution curves for individual variants in the stationary distribution as a function of the mu-tation rate, ¯xi(p). The chain length isl = 10 corresponding to 1024 binary sequences on a fitnesslandscape defined by equation (20). The upper part of the figure was calculated withd = 0.5 andshows simple quasispecies behavior with an error thresholdat pcr = 0.0110, the lower part withfully developed scatter atd = 1.0 shows an error threshold atpcr = 0.0078 (outside the plot).Choice of parameters:fm = f0 = 1.1, fn = 1.0, s= 919; color code: master sequence (black), 10individual single point mutants (red), and 45 individual double point mutants (yellow).

Page 32: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

32 Tanja Gesell and Peter Schuster

the position of the transition. The equation forptr is rather complicated but it isstraightforward to proceed in several steps:

α = l − fm− fkfk+1− fm+1

, β = l − fm fk( fk+1− fm+1

fm+1 fk+1( fm− fk)and

γ =( fm fm+1− fk fk+1)

2

(l −1) fm+1 fk+1( fm− fk)( fk+1− fm+1),

ζtr =12

(

α + β − γ +√

(α + β − γ)2−4αβ)

,

εtr =

1− ζtr

l −1and ptr =

εtr

1+ εtr.

(21)

This approximation – like the perturbational approach to the position of the errorthreshold – yields astonishingly good results. Forl = 50, fm = f0 = f(0) = 10,11

fm+1 = f(1) = 1, fk = f(50) = 9.9, and fk+1 = f(49) = 2 perturbation theory yieldsptr = 0.0366 andpcr = 0.0450 whereas the values obtained by numerical solutionof full eigenvalue problem areptr = 0.0362 andpcr = 0.0454, respectively [141].

The nature of a quasispecies and the appearance of phase transitions clearlydepend on the particular fitness landscape. We illustrate bymeans of two ex-treme cases, examples of which are characterized by the random seedss = 637and s = 919: For low and moderate scatter of fitness values (d = 0.5) a land-scape specific spreading of the curves of individual sequences within one errorclass is observed and otherwise the pictures are very similar as a comparison ofthe two upper plots in Figs.17 and 18 shows. Increasing the scatter to the maxi-mal value ofd = 1 shows dramatic differences between the two landscapes. Inthefirst case (s= 637) a phase transition appears atd = 0.950, two consecutive phasetransitions are observed atd = 0.995 and eventually at full scatter (d = 1) threephase transitions are found and the corresponding master sequences for individualranges are:0 for 0≤ p < 0.000655,1003 for 0.000655≤ p < 0.001792,923 for0.001792≤ p < 0.002790, and247 for 0.002790≤ p < 0.00832= pcr.

The second case (s= 919) exhibits no phase transitions at all in the entire range0≤ p< 0.00839= pcr and has be characterized as astrong quasispecies. The specialstability of the master sequence, which is not replaced by any other sequence of highfitness can be explained by means of the distribution of fitness values: The mastersequence0 ( f0 = 1.1), the fittest one error mutant4 ( f4 = 1.0966) and the fittesttwo error mutant516 ( f516 = 1.0970) form a cluster that is coupled by intensivemutational flow and backflow (Fig. 19), and this highly efficientclanof sequences isso strong that it cannot be outperformed by another group of sequences surroundinga master sequence. A diagnostic tool for predicting strong quasispecies is shownin Fig. 19: The fittest mutant of the two error class in the (Hamming distance one)neighborhood of the fittest one error mutant has to have its fittest neighbor not in the

11 The subscript in parentheses stands for the error class, e.g., for l = 50 f(1) is the fitness of all 50one error sequences orf(50) is the fitness of the single 50 error sequence.

Page 33: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 33

Fig. 19 Mutation flow in strongquasispecies.The sketch focuses on the sequence space aroundthe master sequence (0; chain lengthl = 10) and the sequences with the highest fitness valuesare indicated by boldface numbers (The numbers are decimal values of binary sequences, e.g.516≡1000000100). Normal mutation flow is shown in gray thin lines, strong mutation flow inboth directions,0↔4↔516, is indicated by black thick lines and the related but weak ( and there-fore unimportant) flow,0↔512↔516, by black dotted lines. Detailed values:s = 919; fitnessvalues: f0 = 1.1000, f4 = 1.0966, f512 = 0.9296, f516 = 1.0970; color code: master sequence(black), one error class with 10 single point mutants (red),two error class with 45 double pointmutants (yellow), and three error class with 120 triple mutants (green).

three error class – as would be favored by probability – but inthe one error class.Since the neighborhood of any sequence from the two error class consists of twomutants from the one error class and eight mutants from the three error class thechance to encounter such a situation is 2/10 or 20 %.

Is there an evolutionary advantage for strong quasispecies? Apparently, underconditions of shifting mutation rates phase transitions can destabilize populationswhen they drift into regimes where other quasispecies are stable. Whether we aredealing with a strong quasispecies or not depends on the fitness landscape and arough estimate how likely a randomly chosen region in sequence space would besuitable for a strong quasispecies can be found yields about20 %. This is not asmall fraction and populations would not need to drift very far to reach an area insequence space that supports stable mutant distributions.A prediction to be checkedexperimentally would be therefore that evolution had foundsuch regions and naturalpopulations should represent strong quasispecies.

Page 34: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

34 Tanja Gesell and Peter Schuster

Neutrality.

Neutrality can be introduced into random landscapes in straightforward way bymeans of a predefined degree of neutrality,λ . Then the fitness landscape is of theform

f (S j) =

f0 if η(s)j ≥ 1−λ ,

fn + 2d1−λ ( f0− fn)

(

η(s)j −0.5

)

if η(s)j < 1−λ ,

j = 1, . . . ,κ l ; j 6= m .

(22)

with the two limiting cases: (i) limλ → 0 yielding the non-neutral random land-scape (20) and (ii) limλ → 1 leading to the fully neutral case as modeled andanalyzed by Motoo Kimura [94]. Evolution on neutral landscapes is described byneutral networks (section 2.3.3) formed from sequences of identical fitness. Depend-ing on the Hamming distance between neutral master sequences they form either agroup of sequences coupled by selection dynamics or random selection takes placeand only one sequence survives in the sense of Kimura’s theory. The case of vanish-ing mutation rates, limp → 0, has been analyzed for two neutral sequences [141],S j andSk and different Hamming distancedH(S j ,Sk):

1. dH = 1: limp→0xjxk

= 1 or limp→0 x j = limp→0 = xk = 0.5,

2. dH = 2: limp→0xjxk

= α or limp→0 x j = α/(1+ α), limp→0 xk = 1/(1+ α),

with some value 0≤ α ≤ 1, and3. dH ≥ 3: limp→0 x1 = 1, lim p→0 x2 = 0 or limp→0 x1 = 0, lim p→0 x2 = 1.

In full agreement with the exact result we find that two fittestsequences of HammingdistancedH = 1 are selected as a strongly coupled pair with equal frequency of bothmembers.

Numerical results show that strong coupling does not occur only for small mu-tation rates but extends over the whole range ofp-values fromp = 0 to the errorthresholdp = pcr (Fig.20). Examples for case 2 are also found on random neutrallandscapes and again the exact result for vanishing mutation rate holds up to the er-ror threshold. The existence of neutral nearest and next nearest neighbors manifestitself by the lack of a unique consensus sequence of the population has an importantconsequence for phylogeny reconstruction (see Fig.21). Asshown in the sketch inFig.14 neutral networks may comprise several sequences andthen, all neutral near-est neighbor sequences form a strongly coupled cluster in reproduction where theindividual concentrations are determined by the largest eigenvector of the adjacencymatrix of the network.

Page 35: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 35

Fig. 20 Quasispecies on a realistic landscape with neutral sequences.Two neighboring se-quences of highest fitness form amaster pairof Hamming distancedH = 1, which is surroundedby 18 single point mutations. The plot shows the dependence of the joint quasispecies on the pointmutation ratep. The concentrations of the two master sequences are practically identical for allmutations rates. Choice of parameters:fm = 1.1, fn = 1.0, d = 0.5, λ = 0.1; color code: mastersequences (black and red), 18 individual single point mutants of both master sequences (orange).

Fig. 21 Quasispecies and consensus sequences in case of neutrality . The upper part of the figureshows a sketch of sequences in the quasispecies of two fittestnearest neighbor sequences (dH = 1).The consensus sequence is not unique in a single position where both nucleotides appear withequal frequency. In the lower part the two master sequences have Hamming distancedH = 2 anddiffer in two positions. The two sequences are present at some ratioα that is determined by thefitness values of other neighboring sequences, and the nucleobases corresponding to the two mastersequences appear with the same ratioα .

Page 36: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

36 Tanja Gesell and Peter Schuster

2.3.5 Advantages and deficits of the quasispecies concept

The conventional synthetic theory of evolution considers reproduction of organismsrather than molecules [105]. In contrast, the kinetic theory of evolution [42, 46, 47]is dealing with the evolutionary processes at the molecularlevel. Correct repro-duction and mutation are implemented as parallel reactionsand various detailedmechanisms of reproduction can be readily incorporated into the kinetic differentialequations. An extensively investigated example is RNA replication by means of anenzyme from the bacteriophage Qβ [11, 12, 13, 171]. The detailed kinetic analy-sis reveals the condition for the applicability of the mutation selection model (15):The RNA concentration has to be smaller than the concentration of the replicationenzyme. Provided the mechanism of reproduction is known it is straightforward toimplement replication kinetics in a system of differentialequations [120, pp.29-75].This is also true for epigenetic effects, which may require the use of delay differen-tial equations. Furthermore, the kinetic theory rather than the conventional geneticsapproach is the appropriate basis for molecular evolution and, in particular, molec-ular phylogeny since these concepts deal with genes and genomes within organismsand not with the organisms themselves.

Explicit consideration of mutations as parallel reactionsto error-free reproduc-tion is manifested in the structure of stationary populations, noad hocassumptionsare required for the appearance of mutants. In addition, high mutation rates as found,for example, in test tube evolution experiments [9, 14, 149]and in virus reproduc-tion [34] do not represent a problem and are handled equally well as low mutationrates. The kinetic model relates evolutionary dynamics directly to fitness landscapes,which have a straightforward physical interpretation and can be measured. Unclearand ambiguous results as obtained with simple models of fitness landscapes demon-strate that an understanding of evolution is impossible without sufficient knowledgeon the molecular basis of fitness. In simple systems fitness isa property that canbe determined by the methods of physics and chemistry, and thus independentlyof evolutionary dynamics. The current explosion of harvested data provides a newsource of molecular information that can be used for the computation of fitness val-ues under sufficiently simple conditions.

As a starting point the phylogenetic approach focusses on the evolution of indi-

vidual sitess(i)k in a sequenceSi =

(

s(i)1 s(i)

2 · · ·s(i)l

)

whereas the kinetic model con-siders full sequences initially as expressed by the mutation frequenciesQ ji and fit-ness valuesfi . The independent site model of theoretical phylogeny is augmentedby dependencies on other sites with increasing complexity eventually ending up ataccounting all sites in the sequence. The kinetic approach progresses in oppositedirection from sequences to smaller entities but eventually encounters the same pa-rameter problem as the phylogenetic method. The uniform error model, for example,assumes independent mutation of and equal frequencies of mutation at all sites. Amajor difference between the two approaches concerns handling of mutations. Inphylogeny the matrixQ is a rate matrix whereas the kinetic approach separates mu-tation expressed by the matrix Q and structural or functional effects represented bythe fitness values in matrix F. This has the advantage that allfitness related energetic

Page 37: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 37

effects, for example base pairing in double helical structure fragments are enteringF rather than Q. Of course, there are sequence dependent effects on pure mutationlike the occurrence ofhot spotswith higher mutation frequencies than at the othersites of the sequence, which commonly depend on RNA structure. Again taking intoaccount site and context dependent point mutation rates increases the number of pa-rameters enormously and considering to much detail encounters the same problemas in phylogeny. In the era of genomics the whole sequence approach appears moreappropriate and the wealth of currently available data are more easily introducedinto the kinetic approach.

The quasispecies is the stationary solution of a deterministic, differential equa-tion based model that, in principle, is bound to constant population size. Analysisof the basic ODEs, however, has shown that the results in relative concentrations(xi ; ∑n

i=1xi = 1) are generally valid as long as the population does neitherdie out(∑n

i=1Ni = 0) nor explode (∑ni=1Ni = ∞) [47]. Then, the particle number are not

normalizable and the structure of the population cannot be predicted from solutionsof Equ.(15). Nevertheless, quasispecies may still exist for vanishing populations butany rigorous treatment has to start out from the original kinetic equations with popu-lation size being treated as a variable. Variants with zero fitness are compatible withthe error threshold phenomenon [155, 158] but the prerequisites for the conventionalcalculation of the quasispecies are no longer fulfilled – notevery sequence can bereached from every sequence by a finite chain of point mutations (section 3.2.2).

Another question is hard to answer at present: Do the populations in nature everreach a stationary state?In vitro evolution experiments can be carried out in sucha way that stationarity or mutation equilibrium is achieved, but is this true alsoin nature. In virus infections specific mutants appear also within the infected hostbut on the other hand the effect of infection and the course ofdisease is rathersimilar with comparable hosts indicating that viruses are in a comparable state atthe beginning of an infection.

Two other problems are quite general in biological modelingin particular onthe molecular level: (i) Most models assume spatial homogeneity whereas cellsare highly structures objects with limited diffusion, active transport, and spatial lo-calization of molecular players, and (ii) many results are derived from differentialequations, which are based on the use of continuous variables, and thereby it isimplicitly assumed that populations are very large. In principle, the definition ofcontinuous space and time requires infinite population sizewhat is a reasonable andwell justified assumption in chemistry but not in biology where sample sizes maybe very small. Examples are the often extremely small concentrations of regulatorymolecules. Systematic studies on the very small bacteriumMycoplasma pneumo-niae in the spirit of systems biology [76, 98, 183] have shown thatin extreme casesonly one molecule per hundreds or even thousands of bacterial cells is present ata given instant in the population. Also the assumption of a homogeneous space isquestionable: There is very little free diffusion in real cells and even bacterial cellshave a very rich spatial structure. Stochasticity plays an important role and discretestochastic rather than deterministic continuous variables should be applied.

Page 38: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

38 Tanja Gesell and Peter Schuster

Finally we mention a general problem for evolutionary models. Using conven-tional modeling with ODEs all populations would extend overwhole sequencespace. A drastic example is the uniform distribution of sequences is predicted atmutation rates above error threshold. Coverage of sequencespace can never oc-cur in a finite world: Even for small RNA molecules of tRNA sizewe would needa population size ofN = 1046 individuals in order to have one molecule for everypossible sequence whereas the largest populations inin vitro experiments with RNAhardly exceedN = 1015 molecules. What we have instead are clones of sequencesmigrating through sequence space (see, for example, [62, 63, 84, 119]. Truncationof fitness landscapes has been suggested recently as a possible solution to the prob-lem [130]. Alternatively, one could leave the full landscape and truncate populationsthrough setting all concentrations less than one molecule per reaction volume equalto zero and eliminate the corresponding variables. New variables come into playwhen their concentration exceeds this truncation threshold similarly as occurring instochastic processes.

2.3.6 Evolutionary optimization of structure

In order to simulate the interplay between mutation acting on the RNA sequenceand selection operating on the phenotypes, here the RNA structures, the sequence-structure map has to be an integral part of the model [60, 61, 62, 63]. The sim-ulation tool starts from a population of RNA molecules and simulates chemicalreactions corresponding to replication and mutation in a continuous stirred flow re-actor (CSTR, see section 4.2) and uses an algorithm developed by Daniel Gillespie[68, 70]. In target search problems the replication rate of asequenceSk, represent-ing its fitnessfk, is chosen to be a function of the structure12 distance between themfe structure formed by the sequence,Yk = Φ(Sk) and the target structureYT ,

fk(Yk,YT) =1

α + dS(Yk,YT)/l, (23)

which increases whenYk approaches the target (α is an adjustable parameter thatwas commonly chosen to be 0.1). A trajectory is completed when the populationreaches a sequence that folds into the target structure. Thesimulated stochastic pro-cess has two absorbing barriers, the target and the state of extinction. For sufficientlylarge populations (N > 30 molecules) the probability of extinction is very small, forpopulation sizes reported here,N ≥ 1000, extinction has been never observed.

A typical trajectory is shown in Fig.22. In this simulation ahomogenous popu-lation consisting onN = 1000 molecules with the same random sequence and cor-responding mfe structure is chosen as initial condition (Fig.23). The target structureis the well-known clover leaf of phenyl-alanyl-transfer RNA (tRNAphe). The meandistance to target of the population decreases in steps until the target is reached

12 Several measures for the distance between structures can beapplied. Here we have chosen theHamming distance between the parentheses notation of structures,dS.

Page 39: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 39

50

40

30

20

10

10

8

6

4

2

0

0

Ha

mm

ing

dis

tan

ce

to

ta

rge

tH

am

min

g d

ista

nce

30

20

10

0

Ha

mm

ing

dis

tan

ce

replications

2.00.0 4.0 6.0 8.0 10.0 12.0 14.0 10´6

Fig. 22 A trajectory of evolutionary structure optimizatio n. The topmost plot presents the meandistance to the target structure of a population of 1000 molecules (The initial random structure andthe target structure are shown in Fig.23. The plot in the middle shows the width of the populationin Hamming distancedH and the plot at the bottom is a measure of the velocity with which thecenter of the population migrates through sequence space. Diffusion on neutral networks causesspreading on the population in the sense of neutral evolution [84]). A synchronization is observedat the end of each quasi-stationary plateau where a new adaptive phase in the approach towardsthe target is initiated. The synchronization is caused by a drastic reduction in the population widthand a jump in the population center (The top of the peak at the end of the second long plateau ismarked by a black arrow). A mutation rate ofp= 0.001 was chosen, the replication rate parameteris defined in Equ.(23).

[60, 63, 137]. Short adaptive phases are interrupted by longquasi-stationary epochs,the latter falling into two different scenarios:(i) The structure is constant and we observe neutral evolution in the sense of Kimura[94]. The numbers of neutral mutations are proportional to the numbers of replica-tions and the evolution of the population can be understood as a diffusion processon the corresponding neutral network [84].(ii) The process during the stationary epoch involves several structures with iden-tical replication rates and the evolutionary process is a kind of random walk in thespace of these neutral structures.

Page 40: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

40 Tanja Gesell and Peter Schuster

Fig. 23 Initial and target structure of the computer optimization experiment in Fig.22. Struc-tureY0 is the mfe structure of a random sequence with chain lengthl = 76. The secondary structureof phenyl-alanyl-transfer RNA (tRNAphe) was chosen to be the target structureY44. Stacking re-gions are shown in color. The two sequences,S0 and S44, are shown below; they differ in 46positions. Structures were computed with theVienna RNA Package, Version 1.8.5.

The diffusion of the population on the neutral network is illustrated by the plot inthe middle of Fig.22 showing the width of the population as a function of time[137, 138]. The population width increases during the quasi-stationary epoch andsharpens almost instantaneously after a sequence that allows for the start of a newadaptive phase in the optimization process had been produced by mutation. The sce-nario at the end of the plateau corresponds to abottle neckof evolution. The lowerpart of the figure shows a plot of the migration rate or drift ofthe population cen-ter and confirms the interpretation: The drift is almost always very slow unless thepopulation center ‘jumps’ from one point in sequence space to another point fromwhich the new adaptive phase is initiated. A closer look at the figure reveals the co-incidence of the three events: (i) beginning of a new adaptive phase, (ii) collapse-likenarrowing of the population, and (iii) jump-like migrationof the population center.

2.3.7 From sequences and structures to genotypes and phenotypes

Genotypes or genomes are RNA or DNA sequences. The phenotypecomprisesstructures and functions, both of which are dependent on thespecific experimen-tal or environmental setup. The simplest system capable of selection and mutationconsists of RNA molecules in a medium that sustains replication. Then, genotypeand phenotype are simply the RNA sequence and structure, respectively. In case of

Page 41: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 41

the Qβ evolution experiments the functional requirements are a result of the replica-tion mechanism [11, 12, 13]. In order to be replicated the Qβ -virus RNA moleculesmust carry an accessible recognition site for binding to theenzyme Qβ -replicase[8, 171] and the replication involves a complementary or plus-minus strand copying

mechanism withf (+)j and f (−)

j being the replication rate parameters for plus-strandand minus strand synthesis. After internal equilibration the plus-minus ensemblegrows exponentially with an overall fitness constant, whichis the geometric mean

of the fitness values of both strands:f j =√

f (+)j f (−)

j . Thus, the phenotype in thecase of Qβ -replication in the test tube is the ensemble consisting of both strands.

Outside plant cells viroids arenaked, cyclic, and especially stable RNA moleculeswhose sequences are the viroid genotypes. Viroid RNAs are multiplied throughtranscription by the host cell machinery and carry specific recognition sites atwhich transcription is initiated. Replication ofPotato spindle tuber viroid, for ex-ample, starts predominantly at two specific sites with the closely related sequencesGGAGCGA at positionA111 andGGGGCGA at positionA325 of the viroid RNAwith a chain length ofl = 359 nucleotides [53] – the two positions are almost onopposite sides of the cyclic RNA, 214 or 145 nucleotides apart. Viroid RNAs havemany loops and bulges that serve two purposes: (i) They allowfor melting of the vi-roid RNA since a fully double stranded molecule would be too stable to be opened,and (ii) they carry the recognition sites and motifs for replication and system traf-ficking [32, 184], which have also been studied on the 3D structural level [185].Although viroid reproduction in nature requires a highly specific host cell and therewas a common agreement that viroids in general are highly species specific, recentattempts to replicateAvocado sunblotch viroidin yeast cells have been successful[31]. The viroid phenotype is already quite involved: It hasa structural componentthat guarantees high RNA stability outside the host cell, but at the same time thestructure is sufficiently flexible in order to be opened and processed inside the cell.Like the Qβ RNA, viroid RNA carries specific recognition sites for initiation andcontrol of the reproduction cycle.

Viruses, in essence, are like viroids but the complexity of the life cycle is in-creased by three important factors: (i) virus DNAs or RNAs carry genes that aretranslated in the host cell, yield virus specific factors, and accordingly viruses havegenetic control on the evolution of these coding regions,(ii) the virus capsid maycontain functional protein molecules, for example replicases, in addition to the virusspecific genetic material, and (iii) viruses are coated by virus specific proteins orproteins and membranes (For a recent treatise of virus evolution see [34]).

Bacterial phenotypes are currently to complex for a comprehensive analysis atthe molecular level. An exception are the particulary smalland cell-wall free bac-teria of the genusMycoplasma. In particular,Mycoplasma genitaliumis a parasiticbacterium and was considered to be the smallest organism forquite some time.13 Itsgenome consists of one circular chromosome with 582 970 basepairs and 521 genesof which 482 encode for proteins. Extensive studies aiming at full systems biology

13 The smallness record is currently hold byNanoarchaeum equitanswith a genome size of 490 885base pairs.

Page 42: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

42 Tanja Gesell and Peter Schuster

of a cellular organism were performed on the some what largerspeciesMycoplasmapneumoniaewith a genome size ofl ≈ 816000 base pairs [76, 98, 183]. For largerorganisms – ordinary bacteria of about tenfold size and small eukaryotic cells –flux balance analysis [114] rather than full molecular systems biology has been per-formed (An early example of flux balance analysis ofEscherichia coliis found in[41]). Another approach to complex phenotypes is based on completely annotatedgenomes and information on gene interactions mainly comingfrom proteomics. Agene interaction network has been worked out recently for a small eukaryotic cellwith roughly 6000 genes [29] and it is overwhelmingly complex.

3 From theory to applications

Here, we shall illustrate how the theory of evolution is applicable to problems ofpractical interest. We present a few selected examples of applications from phy-logeny and evolutionary dynamics, although the field has become so rich that onecould easily fill books with examples.

3.1 Applications of phylogenetic sequence evolution models

An in-depth knowledge of how sequences evolve may evidentlyhelp us to improvethe reconstruction of phylogenetic trees based on sequences. This classical appli-cation is described in section 3.1.1. Regardless of this ongoing debate, sequenceevolution models in recent years have also addressed the fields of RNA researchfrom a broader viewpoint. Here, we only mention two further applications: the valueof phylogenetic simulation for distinguishing ancestral and functional correlations,in section 3.1.2 and the application of phylogenetic complex models to genomicncRNA screens in section 3.1.3.

3.1.1 Phylogenetic tree inference

There are currently four main methods of phylogenetic inference: methods basedon the parsimonious principle, i.e. maximum parsimony [57], statistical methodssuch as maximum likelihood [54] or Bayesian inference [122], and distance-basedmethods like neighbor-joining [131]. For further examplesof tree reconstructionmethods and detailed descriptions we refer to [55].

Even though, work has been done on the development of RNA base-pair substi-tution models with non-overlapping tuples, it has not yet been widely adopted bythe scientific community. One reason could be that a priori knowledge about themolecular structure is necessary. The other reason could bethat the improvementsin phylogenetic inference afforded by these models are not significant enough in

Page 43: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 43

comparison to independent models with rate heterogeneity.The bias introduced insequence analysis by ignoring heterogeneous rates among sites has been studiedin population genetics [cf. 2] and phylogenetic reconstruction [180], where it hasbeen shown that the inclusion ofΓ -distribution usually improves the estimation ofother evolutionary parameters, including tree topology. Analyzing the efficiency ofthree reconstruction methods when sequence sites are not independent, Schonigerand von Haeseler [135] demonstrated that the inferred tree is not greatly affectedby the presence of these kinds of correlations. In a comparative study Savill et al.[133] have shown that models permitting a non-zero rate of double substitutionsperformed better than those restricting - as is usually done- the number of allowedsubstitutions to one per unit time. A software package, calledPHASE [82], for phy-logenetic inference of RNA sequences with a range of base-pairs substitution mod-els is available. A simulation study has been recently published that recommends toinclude RNA secondary structure during phylogenetic inference [92].

Direct combination, such as in the case of thermodynamic nearest neighbor mod-els that follow a phylogenetic approach, is considerably harder to implement [182].Standard methods can no longer be used for likelihood computation and parameterestimation if Markov random fields arise. Therefore, a lot ofwork is currently being

S1: UAUUCGGCACGUACGAGGAAUGCUGGUAAG

S2: UAUUCGGCACGUACGAGGAAUGCUGGUAAG

S3: UUUACGGCACGUACGAGGAAUCCUGGUAUG

ESTIMATIONSIMULATION

5

2

S3S2 S5S1

e1 e

2

e3

e4

e5

e8

e7

e6

S4

5

2

S3S2 S5S1

e1 e

2

e3

e4

e5

e8

e7

e6

S4

S4: UUUACCGCACGUACGAGGAAUCCUCGUUUG

S5: UUUACCGCACGUACGAGGAAUCCUCGUUUG

S1: UAUUCGGCACGUACGAGGAAUGCUGGUAAG

S2: UAUUCGGCACGUACGAGGAAUGCUGGUAAG

S3: UUUACGGCACGUACGAGGAAUCCUGGUAUG

S4: UUUACCGCACGUACGAGGAAUCCUCGUUUG

S5: UUUACCGCACGUACGAGGAAUCCUCGUUUG

Fig. 24 Phylogenetic simulations and estimations under constraints. Whenever we analyze aset of homologous sequences, we have to take their evolutionary history of the observed sequencesinto account. Simulating sequence evolution, while takingsite-specific interactions into account,is helpful for investigating the performance of both tree-building methods and structure predictionmethods. Furthermore, they are of value in themselves because they contribute to our understandingof the interrelation between structure and the substitution process.

Page 44: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

44 Tanja Gesell and Peter Schuster

done with the purpose of covering the required technical skills, even though thesemethods are still not practical on a wider scale.

3.1.2 Phylogenetic simulations: Ancestral and functionalcorrelations

The models used for simulation need to be more accurate and complex descrip-tions of nature than those used for inference. Taking molecular structure into con-sideration while simulating sequence evolution is helpfulfor investigating the per-formance of both tree-building methods and structure prediction methods. Further-more, these simulations are of value in themselves because they contribute to ourunderstanding of the intertwined relationship between structure and the substitu-tion process. For example, the use of supervised sequence evolution allows us tocontrol and study the extent of structural and sequence conservation as RNA struc-ture stability or the influence of phylogentic diversity. The necessity of a structuredefinition from a phylogentic viewpoint can be illustrated by the ambiguous def-inition of RNA families LATEX . Although methods using phylogeny and structurealready exist, an explicit definition from a phylogenetic viewpoint was missing. Aphylogenetic structure has since been introduced [65]. This definition is based on asimulation framework that includes a model which constitutes a range of differentsubstitution models acting on a sequence and an annotation of correlations amongsites, a so-called neighborhood system as described in section 2.1.3 . A phylogeneticstructure (PS) is then defined by a neighborhood system, a substitution model anda phylogenetic tree (Fig. 25). The substitution model specifies the evolutionary pro-cess of nucleotide evolution. However, the model is influenced by the neighborhoodsystem that defines the interactions among sites in a sequence. The phylogenetictree introduces an additional dependency pattern in the observed sequences. A PS

⇒ A

U

C

G

A

U

C

G

A

U

C

G

A

U

C

G ∪R

A

B

C

D E

a

b

d e

c

N eighbourhood System Substitution Model PhylogeneticTree

Fig. 25 An example of a phylogenetic structure. Left:an example of a thermodynamically im-probable neighborhood system was chosen for didactic reasons. Middle: modelQ constitutes acollection of possibly different substitution modelsRight: example of a phylogenetic tree withthree extant taxa.

Page 45: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 45

appears ina set of sequences at different instants. These can be transformed, forexample, into the minimum free energy secondary structure.A realization of a PSis then a relational object at instantt. Fig. 26 is an illustration of what has just beendescribed, taking the example of a phylogenetic structure of Fig. 25 in relation toa minimum free energy (mfe) structure. For didactic reasons, a thermodynamicallyimprobable neighborhood system with a helix of such length and a long part of inde-pendent sites is used. However, this thermodynamically improbable neighborhoodsystem influences the collection of different substitutionmodels acting on the se-quence. As part of this concept, it is possible to mimic sequence evolution under thestructural constraint of the PS, e.g given the condition of compensatory mutationswith an extended Felsenstein model. Due to the stochastic nature of the substitu-tion process, however, sequences will probably observed inthe course of evolutionthat at least temporarily - exhibit predicted structures, deviating to some extent fromthe neighborhood system. Fig. 26 shows the predicted mfe-structure of a generatedsequence as one possible realization of the phylogenetic structure (Fig. 25) at timed = 0.42 on brancha. In comparison to the neighborhood system of the PS, the longhelix of the neighborhood system maps well onto the helix of realizations. In con-trast the upper independent part is folded due to the thermodynamic impossibility ofsuch a long loop. In a nutshell: although the neighborhood system is thermodynam-ically improbable, it is transformed in possible thermodynamic realizations givenan evolutionary history. Fig. 27 illustrate the diversity of realizations of a PS as de-fined in Fig. 25: the stem region of the realization should similarly be defined bythe constraint through the neighborhood system and the substitution model, while

A

T

C

G

R

A

B

C

D E

a

b

d e

c

mfe-realisation

time d=0.42

PS

Fig. 26 PS transformation: A Phylogenetic Structure (PS) of Fig. 25 is summarized on theleftand is transformed at timed = 0.42 in a mfe realization and suboptimal mfe realizations on theright. By comparing the realization with the neighborhood system most base pairs are the same.However, due to the thermodynamical improbability of such along loop, the upper independentpart is folded.

Page 46: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

46 Tanja Gesell and Peter Schuster

the upper loop region has no site-specific interactions and should lead to differentrealizations through the mfe folding. In the upper part, however, we can observesimilar mfe folds between D and E at timed = 0.31, based on the previous ancestralstate at C. With time, the differences between D and E become progressively moreapparent. In the uppermost frame, the species B and D have more structural simi-larities than either has with E, although D and E are more closely related. Thus, thephylogenetic history suggests that D and E are a family. However, according to theobserved secondary structure realization B and D form a class at timed = 0.48.

Following a PS, we have to distinguish between different constraints for inter-actions observed among the sites. A structural constraint defines the evolutionarystrength of structuring sequences at different instants t and can be differentiatedas: ancestral constraint and neighborhood constraint. Neighborhood constraints aresite-specific interactions acting on the sequence along theevolutionary process. Inthis sense, the interactions observable between the sites through this neighborhoodconstraint are called neighborhood or functional correlations. Following a PS, theevolution of nucleotides must furthermore be taken into account. The states at the

timed = 0.31 timed = 0.48

A

B

C

D E

A

B

C

D E

Fig. 27 Observations of secondary-structure on realization levelswhile evolutionary history isneglected can mislead assignments of families, illustrated by red and gray rectangles. Left: shortlyafter a speciation event, we have similarities based on the realization at the speciation event C andnot on the neighborhood system of Fig. 25 in the upper part. Right: after more time has elapsed,B and D have more structural similarities than either of themhas with E, although D and E aremore closely related in the phylogenetic tree. Thus, the phylogenetic history suggests that D and Eare a family, while according to the observed secondary structure realizations, both B and D (grayrectangle) and D and E (red rectangle) can form a class.

Page 47: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 47

internal nodes of the phylogeny are important because of thelikelihood of the stateremaining unchanged after only a short period of time. This depends on the model;it is an ancestral constraint that defines the influence of ancestral nucleotide distribu-tion at an alignment site and can be associated with observable ancestral correlationsin sequences. In general, term associations, correlationsor dependencies are used torepresent measurements from sequences via different estimation methods. If we es-timate correlations from homologous sequence data, e.g from an alignment, theyare related through their evolutionary history and common ancestral states. Thus,ancestral as well as functional correlations can occur. So far, structure predictionmethods have mostly been interested in predicting dependencies that result fromneighborhood constraints.

3.1.3 Phylogenetic background models for genomic screens

The consensus structure prediction programPfold [96] uses a 16×16 RNA base-pair substitution model combined with a context-free grammar similar to the gen-finder programsEvoFold [118] andQFold [125]. For assessing the significanceof predicted structures, e.g. estimating the false discovery rate in a genomic screenfor ncRNAs, the genomic predictions should be compared to the results obtainedfrom randomized data with the same dinucleotide content. Inthe case of singlesequences, well known and widley used algorithms are available for generating din-ucleotide controlled random sequences either by shuffling or by first Markov chainsimulations [1, 28]. One approach is to simulate the alignments of a given dinu-cleotide content [67].SISSIz is based on a complex substitution model that cap-tures the neighbor dependencies and other important alignment features except forthe signal in question. In addition, it directly combines the phylogenetic null modelwith theRNAalifold [81] consensus folding algorithm, thus yielding to a newvariant of a thermodynamic structure based RNA gene finding program that is notbiased by the dinucleotide content. For further details on ncRNA gene finding pro-grams we refer to chapter LATEX .

3.2 Controlled evolution and evolutionary design

Evolution under natural conditions is much too complex for modeling and the lackof detailed understanding at the molecular level is prohibitive for applications ofevolution to solve practical design. Controlled environmental conditions and reduc-tion of the complexity of the evolving unit, however, allow for the utilization ofevolutionary methods in biotechnology. We choose here three prominent examples:(i) applied molecular evolution, (ii) lethal mutagenesis in viral pathology, and (iii)controlled bacterial evolution.

Page 48: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

48 Tanja Gesell and Peter Schuster

3.2.1 Evolutionary design of molecules

Experimental [149] and theoretical studies [42] showed that evolution through mu-tation and selection is not bound to the existence of cellular life but occurs readilywith RNA molecules in test tube experiments (For details we refer to a recent reviewon RNA evolution in laboratory experiments [90]). The knownmechanistic detailsof in vitro evolution encouraged the development of techniques exploiting the evo-lutionary principle for the production of designed molecules with given functions.All these methods are based on sequence variation and selection, whereby variationhas become the easy part since the degree of sequence changescan be varied widelyfrom mutation with low mutation rates to random synthesis ofpolynucleotides. Se-lection for predefined function is the tricky part of directed evolution and commonlyrepresents a challenge for the intuition of the experimenter.

Selection by exponential enrichment (SELEX) has become a popular and rou-tinely used method [52, 163]. Targets for binding are attached to the stationary phaseof a chromatographic column and a solution containing a variety of RNA moleculesis poured through the column. Molecules with high affinity tobind to the target,calledaptamersare first retained at the column, then washed out by another solvent,and the whole procedure is repeated several times with mutated samples derivedfrom the best binders at this moment. Binding constants almost as large as thoseobserved with the strongest natural aggregates can be obtained after some twentyselection cycles [95].

Evolutionary design has been applied to a great variety of other problems too.A view examples of recent review articles are given here: (i)directed evolution ofnucleic acid enzymes [89], (ii) design of proteins by evolutionary methods [18, 86],and (iii) applications to design of low molecular weight compounds [176].

3.2.2 Virus evolution - error thresholds and lethal mutagenesis

Virus evolution is a broad field with many aspects. The idea torelate the propertiesof viruses to the structures viral RNA is almost 40 years old:In a pioneering pa-per Charles Weissmann [171] related the life cycle of the RNAbacteriophage Qβ tothe secondary structure of its RNA. The structural difference between newly synthe-sized and mature RNA is exploited as an important regulatoryelement. Mutations onthe RNA genotype may have direct consequences for both RNA structures and virallife cycles [34]. Here only the practical aspect of developing antiviral medication ismentioned. Many antiviral drugs are powerful because they increase the mutationrate and drive virus populations to extinction but for a satisfactory molecular expla-nation of the mechanism the required information on the fitness landscape is stillmissing. Nevertheless,lethal mutagenesisis an important phenomenon and simpli-fied models providing phenomenological explanations have been developed.14

14 An early paper [169] claimed that zero fitness values are incompatible with the existence ofquasispecies and error threshold. The result, however, turned out to be an artifact of a rather naıve

Page 49: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 49

0 0.1 0.2p0

4

8

12

16

mutation rate

num

ber

of le

thal positio

ns

d

uniform distribution

extinction

quasis

pecie

s

Fig. 28 Quasispecies and lethal mutations.The sketch shows the long time behavior of mutation-selection dynamics in presence of lethal mutations. For a sufficiently large number of lethal posi-tions (d < 8) in the virus genotype the quasispecies goes extinct at theextinction threshold (red)whereas for a smaller fraction of lethal variants two thresholds are observed: (i) an error threshold(blue) and an extinction threshold (red). The picture is redrawn from Tejero et al. [158], Fig.2d;choice of parameters: chain lengthl = 20; kinetic parameters:fm = 15, fk = 3, ϑ = 1.

Manfred Eigen and Esteban Domingo originally explained lethal mutagenesiscaused by pharmaceutical compounds increasing the mutation rate through drivingpopulations beyond the error threshold [35, 43]: At mutation rates above the errorthreshold replication becomesrandom15 and the result is a complete loss of thegenetic information and eventually the viral life cycle breaks down. Later on JamesBull and Claus Wilke studied the dynamics of lethal mutagenesis in more detail andclaimed that population extinction is a phenomenon independently of catastrophicerror accumulation [22, 23, 152]. Recently, Francisco Montero and coworkers [158]modeled lethal mutagenesis by means of three-species replication-mutation kineticswith neglect of back mutation

linear sequence space, since later works demonstrated thatselection and mutation on realistic se-quence spaces sustains error thresholds also in presence oflethal variants [155, 158].15 Random replicationexpresses the fact that error accumulation destroys the relation betweentemplate and copy and inheritance is no longer possible.

Page 50: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

50 Tanja Gesell and Peter Schuster

dcm

dt=(

fm(1− p)l −ϑ)

cm ,

dck

dt= fm(1− p)d(1− (1− p)(l−d)

)

cm+(

fk(1− p)d−ϑ)

ck , and

dcj

dt= fm

(

1− (1− p)d)cm+ fk(

1− (1− p)d)ck−ϑ c j ; c =n

∑i=1

ci ,

(24)

whereSm is the master sequence with the concentration[Sm] = cm and the repli-cation rate parameterfm, ck and fk refer to the class of non-lethal mutants, andc j

to the class of lethal mutants,ϑ is the uniform degradation rate parameter for allsequences,l is the chain length,d the number of positions at which mutation yieldsa lethal variant and, eventuallyp the single point mutation rate. This model [158]is characterized by two features: (i) The concentration of the material consumed inthe reproduction process is assumed to be constant,[A] = a0, anda0 is absorbed inthe fitness parameterfi (i = 1, . . . ,n), and (ii) a degradation rateϑ is introduced forall species. In contrast to the selection-mutation equation (15) and the flow reactordiscussed in section 4.2, the model system Equ.(24) does notapproach a station-ary state but the total concentrationc either grows infinitely or goes extinct. Fig.28illustrates the result concerning lethal mutagenesis. There are two different scenar-ios of quasispecies development with increasing mutation ratep, which depend onthe degree of lethality that is expressed in the number of lethal sitesd: (i) At lowlethality the quasispecies reaches first the error threshold at p = pcr, passes a rangeof p-values and then becomes extinct atp = pext, and (ii) at sufficiently high degreeof lethality the error threshold merges with the extinctionthreshold and the quasis-pecies dies out directly atp= pext. It is worth noticing that the stability of the quasis-pecies against mutation increases with increasing degree of lethality correspondingto a shift of the error threshold towards higher mutation rate. Lethal mutagenesis isunderstood at the phenomenological level but when it comes to molecular details,more experimental data and a comprehensive molecular theory is required. Studiesbased on more realistic landscapes including lethal variants into model landscapesin the sense of (20) and (22) are still missing.

3.2.3 What we learn from bacterial evolution

Controlled bacterial evolution has been and still is studied in a long time exper-iment by Richard Lenski. He and his groups started twelve parallel experimentsderived from a single clonalara−16 mutationara+ ↔ ara− is neutral [15]. strainof Escherichia coliin February 1988. The original clone underwent an immediaterevertant mutation toara+, and sixara+ and sixara− population were chosen forthe series experiment. Every day the cultures are propagated by transfer of sam-ples into new growth medium that has been intentionally chosen as poor in glucose,

16 The variantsara+ and ara− differ in a single point mutation and in the capacity to utilizearabinose as nutrient. In growth media free of arabinose the

Page 51: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 51

probes are taken, isolated and deep frozen at regular intervals of 500 generations[101, 102]. Until now the twelve populations have passed about 53 000 generations.Three findings are of direct relevance for this contribution: (i) the early adaptation tothe changed environment [51], (ii) the phylogeny of controlled bacterial evolution[116], and (iii) contingency and repeatability in evolution [15].

Early adaptation to new environmental conditions occurs insteps rather than con-tinuously [51] reminding of the course of structural optimization shown in Fig.22.Although optimization of the phenotype exhibits punctuated appearance, mutationsat the level of DNA sequences occur with a fairly constant rate per generation. Thepopulation evolves by forming clones that become separate in sequence space build-ing thereby phylogenetic trees [116]. After 31 500 generations one of the twelvepopulations produced a mutant, which had the capability forcitrate uptake from thegrowth medium [15].17 This clone had an instantaneous advantage and started togrow much faster and showed a dramatic increase in population size, because it hadconquered a hitherto unexploited niche with a new nutrient.The main question toask about contingency in evolution concerns the probability of the adaptive event:Was it (i) a highly improbable singular event or was it (ii) anordinary event ofcommon probability, which needed a preparation in the sensethat the clone had tomigrate into some region of sequence space before? In the first case there would beno chance to repeat the event, whereas in the second case the invention of a citratechannel should be repeatable if one started from some earlier isolate andran thetape a second time. Richard Lenski and coworkers could find an answer: Scenario(ii) is what happens in theEscherichia coliexperiments and indeed samples isolatedas early as generation 20 000, i.e. 11 500 generations beforethecit+ mutation hadhappened in the original population, were able to develop advantageous citrate vari-ants, whereas none was found with earlier isolates. The experiment is a beautifuldemonstration of contingency in controlled evolution: Migration of the populationin sequence space (in the sense of Fig.22) sets the stage for mutation events.

4 Notes

In our notes we focus on simulation programs. The concept ofinverse foldingasused by the programRNAinverse has already been described in Chapter LATEXwhich searches for sequences folding into a predefined structure. However, theseprograms do not take any phylogenetic relationship into account. The next sec-tion 4.1 deals with sequence evolution models under constraints along phylogenetictrees, including notes for both user and developer. In section 4.2 we introduce aphysical setup, the flow reactor, which is equally well suited for computer simula-tion and experimental implementation.

17 MostEscherichia colistrains are unable to live on citrate buffer because they have no mechanismfor uptake of citrate or citric acid into the cell. The growthmedium used by Lenskiet al. in thelong time evolutions experiment contained citrate buffer for pH control.

Page 52: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

52 Tanja Gesell and Peter Schuster

4.1 Phylogenetic simulation methods under constraint

Generating synthetic data is a significant task. Simulated data have to be generatedwith the same underlying parameters and statistics as the real data to which the toolwill eventually be applied. Parameter estimation is a topicin its own right and we re-fer to other literature as well as chapter LATEX in this book. Given a sequence evolu-tion model, there are at least two ways of simulating the substitution of nucleotidesalong phylogenetic trees. First, employing the matrix of substitution probabilitiesfor any time interval or second, using a rate matrix for an infinitesimally short timeinterval. The first approach requires the transition probability matrix, which is, forexample, calculated by numerical computation of the eigenvalues and eigenvectorsof the rate matrixQ (see section 2.1.1 and 2.3.4). If the number of substitutionislarge, the first probability matrix approach is faster sinceits computing time is inde-pendent of the number of substitutions. The second rate matrix approach, however,provides a way of simulating sequences under more complex models.

So far, different programs have been designed to simulate nucleotide sequencesand protein sequences along a tree [136, 121, 74, 181, 151, 113, 164, 97]. One ofthe most commonly used programsSeq-Gen [121] has implemented a wide rangeof independent nucleotide substitution models. The PHASE package [82] has im-plemented base-paired substitution models, but is specifically designed for RNAsequences with secondary structure without taking any energy parameters into ac-count. From a state-of-the-art perspective, the most important aspect is to be flexiblein terms of simplicity and complexity. We have to allow both general well-knownstructural constraints such as the mfe secondary structureand other constraints,e.g. from specific families, motifs or other RNA or Protein interactions. A methodthat focuses on a unifying framework for simulating sequence evolution with ar-bitrary complexity was therefore developed, implemented in the programSISSI(Simulating Site-Specific Interactions) [66]. Beside the model and the tree, the in-put file is a general neighborhood file for a user-defined neighborhood system oract-file. The neighborhood system can be transformed into another known struc-ture file such as a grammar or motifs file. The framework also allows us to define adifferent substitution matrix for each site. Several othersequence simulators includ-ing indels exit such asDAWG [25] andINDELible [59]. However, none of theseprograms takes site-specific interactions into account. A first algorithm for a simu-lation program including an indel process and site-specificinteractions is describedin [65].

By way of a last general note, the random generator, as with all simulations,should be chosen carefully. Furthermore, if a large number of simulations is run infast succession, it is highly recommended to improve the resolution of the randomnumber generator’s automatic seeding by adding some milliseconds to it. Alterna-tively, most programs offer the option of specifying a seed for the random numbergenerator. This is important for allowing the replication of results, e.g., while testingand debugging, or for repeating the simulation.

Page 53: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 53

Fig. 29 The flow reactor as a device for simulating deterministic and stochastic kinetics.Astock solution containing all materials for RNA replication ([A] = a0) including an RNA poly-merase flows continuously at a flow rater into a well stirred tank reactor (CSTR) and an equalvolume containing a fraction of the reaction mixture ([?] = {a,b,ci}) leaves the reactor (For dif-ferent experimental setups see Watts [170]). The population of RNA molecules in the reactor(S1,S2, . . . ,Sn present in the numbersN1,N2, . . . ,Nn with N = ∑n

i=1Ni ) fluctuates around a meanvalue,N±

√N. RNA molecules replicate and mutate in the reactor, and the fastest replicators are

selected. The RNA flow reactor has been used also as an appropriate model for computer simula-tions [60, 61, 84, 120]. There, other criteria for selectionthan fast replication can be applied. Forexample, fitness functions are defined that measure the distance to a predefined target structure andmean fitness increases during the approach towards the target [63].

4.2 The flow reactor and its applications

Models based on differential equations yield solutions also in case of marginal sta-bility. A famous example are the well known oscillations of the Lotka-Volterra sys-tem [109, pp.79-118]. Stochastic models commonly require aprecise physical setupand a well defined environment in order to yield stable and meaningful solutions.The flow reactor provides a defined and experimentally controllable environmentfor deterministic kinetics but it is also a suitable simulation tool for stochastic ap-proaches to chemical reactions based on thechemical master equation[64, 69, 70].Straightforward simulations are limited by the maximal population sizes (< 106)that can be handled in actual computations. These populations are too small formodeling chemical reactions and special techniques were developed that allow forseparation of deterministic and stochastic components [167, 168]. For many biolog-ical applications, however, the tractable population sizes are sufficient.

Page 54: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

54 Tanja Gesell and Peter Schuster

The sketch presented in Fig.29 shows an experimental setup that is suitable foranalysis and computer simulation for various experimentalimplementations aimingat studies of evolution in the laboratory see, e.g., [170]. Two assumptions are com-monly made: (i) The flow reactor is at thermal equilibrium with a controllable heatbath and (ii) the contents of the reactor is well-stirred in order to guarantee spatialhomogeneity. Non-equilibrium conditions are created by a flux of rater that regu-lates influx and provides the source for the material consumed in the reactor and anoutflux of reactor content to compensate for the change in volume. Thermodynamicequilibrium can be studied in the limit (r → 0, t → ∞). The reactor in the sketch(Fig.29) illustrates the optimization of RNA molecules through mutation and selec-tion as described in section 2.3.6 as well as Eqs.(14). Further details can be foundin the literature [61, 60, 84] and [120, pp.9-17].

5 Prospects of evolution

The present increase of molecular knowledge in the life sciences ranging frombiopolymers to whole organisms, populations, and ecosystems is phenomenal. Theadvances in technology allow nowadays for harvesting data in large quantities thatwere unaccessible twenty years ago. Whole genome sequencesas well as proteininteraction maps for whole cells are now readily available.Still there is a long wayto go from our present day data to a full understanding of cellular life. In particular,characterization of biomolecular structures and analysisof biochemical functionsare indispensable for the bottom-up approach in the sense ofsystems biology. Theimpressive work on the mini-bacteriumMycoplasma pneumoniae[76, 98, 183] hasdemonstrated the need for biochemical analysis very clearly. The wealth of new datarequires also a novel kind of theoretical biology [19], which sets the stage for mod-eling and computer simulation. As far as structure and phylogeny is concerned thenew developments suggest to extend the initially presentedparadigm of structuralbiology (Fig.2) by introduction of the evolutionary aspectof structure and functionbeing the target of selection, and the role of phylogeny thatcan be visualized as acoarse-grained mapping of evolution onto sequence space.

Phylogeny of RNA is particularly well suited for modeling evolution, because itcomprises fairly simple systems like evolutionin vitro of RNA molecules and allowsfor straightforward stepwise progression in complexity: molecules→ viroids →RNA viruses and retroviruses. Sufficient data for comprehensive models of viroid orRNA virus life cycles are not yet available but there are no real technical obstaclesfor harvesting them and we can expect a lot of progress in the near future. MostRNA viruses mutate with high rates and retrieving phylogenies within and betweenhosts is a hot topic in clinical studies and in epidemiology.Apart from cancer viralinfections are a field that is predestined for personal medicine since the spectrumof individual immunological responses to viral infectionsis rather broad. At presentthe role of RNA in the cell seems to be a never ending story adding more and more

Page 55: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 55

important features and regulatory tasks in genetics and epigenetics to this for longtime underestimated class of molecules.

Full understanding of the evolution of bacteria and eukaryotes on the molecularlevel is still a program for the future but we have learned from other disciplines thatnew techniques may change the situation completely in very short time – genomesequencing serves as the most spectacular example. The key towards such an un-derstanding is the genotype-phenotype mapping as encapsulated in the fitness land-scape. Fitness landscapes forin vitro evolution of RNA are available or at leastaccessible. The future challenge is to progress towards themore complex cases ofviroid and RNA virus evolution and even further to free living organisms.

Biologists familiar with the quite sophisticated tools in bioinformatics might ask,whether simple models like the ones presented and discussedhere can play a futurerole in the era of extensive computer simulations. We think,the answer is definitelyyes. Our small number of examples have been sufficient to demonstrate this andwe expect further progress through combining the fields of phylogenetic and evo-lutionary dynamics described in this chapter. Only sufficiently simple concepts andtheories can provide insights into complex systems and theydefine appropriate ref-erence states. The complexity of the real world is then introduced similarly as inphysics by means of intellectually comprehensible perturbations of the idealizedcases. Admittedly, simple reference models are very often not yet known, complexnetworks may serve as a familiar example, and their development is an importantfuture task for theorists in biology. Biology and chemistryare currently mergingand there is legitimate hope that the common strategies of physicists and chemistsbecome more popular in biology.

Acknowledgements The authors wish to express their gratitude towards CarolinKosiol for help-ful discussions. T.G is funded by a fellowship of the Austrian genome research program GEN-AUand the GEN-AU project ”Bioinformatics Integration Network III”. Financial support to the CIBIVinstitute from the the Wiener Wissenschafts-, Forschungs-and Technologiefonds (WWTF, to Arndtvon Haeseler) is greatly appreciated.

References

[1] S. F. Altschul and B. W. Erickson. Significance of nucleotide sequencealignments: a method for random sequence permutation that preserves dinu-cleotide and codon usage.Mol Biol Evol, 2(6):526–538, Nov 1985.

[2] S. Aris-Brosou and L. Excoffier. The impact of populationexpansion andmutation rate heterogeneity on DNA sequence polymorphism.Mol. Biol.Evol., 13:494–504, Mar 1996.

[3] P. F. Arndt, C. B. Burge, and T. Hwa. DNA sequence evolution with neighbor-dependent mutation.J.Comput.Biol., 10:313–322, 2003.

[4] F. J. Ayala. Vagaries of the molecular clock.Proc. Natl. Acad. Sci. USA,94:7776–7783, 1997.

Page 56: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

56 Tanja Gesell and Peter Schuster

[5] G. Baele, Y. Van de Peer, and S. Vansteelandt. A model-based approachto study nearest-neighbor influences reveals complex substitution patterns innon-coding sequences.Systematic Biol., 57:675–692, 2008.

[6] E. Bapteste, M. A. O’Malley, R. G. Beiko, M. Ereshevsky, J. P. Gogarten,L. Franklin-Hall, F. Lapointe, J. Dupre, T. Dagan, Y. Boucher, and W. Martin.Prokaryotic evolution and the tree of life are two differentthings. BiologyDirect, 4:34, 2009.

[7] J. Berard, J. B. Gouere, and D. Piau. Solvable models of neighbor-dependentsubstitution processes.Math. Biosci., 211:56–88, Jan 2008.

[8] C. K. Biebicher and R. Luce.In vitro recombination and terminal elongationof RNA by Qβ -replicase.The EMBO Journal, 11:5129–5135, 1992.

[9] C. K. Biebricher. Darwinian selection of self-replicating RNA molecules. InM. K. Hecht, B. Wallace, and G. T. Prance, editors,Evolutionary Biology,Vol. 16, pages 1–52. Plenum Publishing Corporation, 1983.

[10] C. K. Biebricher and M. Eigen. The error threshold.Virus Research,107:117–127, 2005.

[11] C. K. Biebricher, M. Eigen, and W. C. Gardiner, Jr. Kinetics of RNA replica-tion. Biochemistry, 22:2544–2559, 1983.

[12] C. K. Biebricher, M. Eigen, and W. C. Gardiner, Jr. Kinetics of RNA repli-cation: Plus-minus asymmetry and double-strand formation. Biochemistry,23:3186–3194, 1984.

[13] C. K. Biebricher, M. Eigen, and W. C. Gardiner, Jr. Kinetics of RNA repli-cation: Competition and selection among self-replicatingRNA species.Bio-chemistry, 24:6550–6560, 1985.

[14] C. K. Biebricher and W. C. Gardiner, Jr. Molecular evolution of RNA in vitro.Biophys. Chem., 66:179–192, 1997.

[15] Z. D. Blount, C. Z., and R. E. Lenski. Historical contingency an the evolu-tion of a key innovation in an experimental population ofescherichia coli.Proc. Natl. Acad. Sci. USA, 105:7898–7906, 2008.

[16] L. Boto. Horizontal gene transfer in evolution: Facts and challenges.Proc. Roy. Soc. B, 277:819–827, 2010.

[17] P. M. Boyle and P. A. Silver. Harnessing nature’s toolbox: Regulatory ele-ments for synthetic biology.J. Roy. Soc. Interface, 6:S535–S546, 2009.

[18] S. Brakmann and K. Johnsson, editors.Directed Molecular Evolution of Pro-teins or How to Improve Enzymes for Biocatalysis. Wiley-VCH, Weinheim,DE, 2002.

[19] S. Brenner. Theoretical biology in the third millenium.PhilTrans. Roy. Soc. London B, 354:1963–1965, 1999.

[20] L. Bromham. The genome as a life-history character: Whyrate of molec-ular evolution varies between mammal species.Proc. Trans. Roy. Soc. B,366:2503–2513, 2011.

[21] L. Bromham and D. Penny. The modern molecular clock.Nature ReviewsGenetics, 4:216–224, 2003.

[22] J. J. Bull, L. Ancel Myers, and M. Lachmann. Quasispecies made simple.PLoS Comp. Biol., 1:450–460, 2005.

Page 57: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 57

[23] J. J. Bull, R. Sanjuan, and C. O. Wilke. Theory for lethal mutagenesis forviruses.J. Virology, 81:2930–2939, 2007.

[24] M. Bulmer. Neighboring base effects on substitution rates in pseudogenes.Mol. Biol. Evol., 3:322–329, Jul 1986.

[25] R. A. Cartwright. DNA assembly with gaps (Dawg): simulating sequenceevolution.Bioinformatics, 21 Suppl 3:i31–38, Nov 2005.

[26] J. Cate, A. Gooding, E. Podell, K. Zhou, B. Golden, C. Kundrot, T. Cech,and J. Doudna. Crystal structure of a group I ribozyme domain: Principles ofRNA packing.Science, 273:2–33, 1996.

[27] O. F. Christensen, A. Hobolth, and J. L. Jensen. Pseudo-likelihood analysis ofcodon substitution models with neighbor-dependent rates.J. Comput. Biol.,12:1166–1182, Nov 2005.

[28] P. Clote. An efficient algorithm to compute the landscape of locally opti-mal RNA secondary structures with respect to the Nussinov-Jacobson energymodel.J. Comp. Biol., 12:83–101, 2005.

[29] M. Costanzo, A. Baryshnikova, J. Bellay, Y. Kim, E. D. Spear, C. S. Sevier,H. Ding, J. L. Y. Koh, K. Toufighi, S. Mostafavi, J. Prinz, R. P.S. Onge,B. Van der Sluis, T. Makhnevych, F. J. Vizeacoumar, S. Alizadeh, S. Bahr,R. L. Brost, Y. Chen, M. Cokol, R. Deshpande, Z. Li, Z.-Y. Li, W. Liang,M. Marback, J. Paw, B.-J. San Luis, E. Shuteriqi, A. H. Y. Tong, N. van Dyk,I. M. Wallace, J. A. Whitney, M. T. Weirauch, G. Zhong, H. Zhu,W. A.Houry, M. Brudno, S. Ragibizadeh, B. Papp, C. Pal, F. P. Roth, G. Giaver,C. Nislow, O. G. Troyanskaya, H. Bussey, G. D. Bader, A.-C. Gingras, Q. D.Morris, P. M. Kim, C. A. Kaiser, C. L.Myers, B. J. Andrews, andC. Boone.The genetic landscape of a cell.Science, 317:425–431, 2010.

[30] C. Darwin.The Origin of Species. Murray edition, London, 1859.[31] C. Delan-Forino, m. Maurel, and C. Torchet. Replication of avocado sun-

blotch viroid in the yeastsaccaromyces cerevisiae. J. Virology, 85:3229–3238, 2011.

[32] B. Ding and A. Itaya. Viroid: A useful model for studyingthe basic principlesof infection and RNA biology.Molecular Plant Microbe Interaction, 20:7–20, 2007.

[33] T. Dobzhansky. Nothing Make Sense Except in the Light ofEvolution.Am. Biol. Teach., 35:125–129, 1973.

[34] E. Domingo, C. R. Parrish, and H. John J, editors.Origin and Evolution ofViruses. Elsevier, Academic Press, Amsterdam, NL, second edition,2008.

[35] E. Domingo, ed. Virus entry into error catastrophe as a new antiviral strategy.Virus Research, 107(2):115–228, 2005.

[36] W. F. Doolittle. Phylogenetic classification and the unversal tree.Science,284:2124–2128, 1999.

[37] W. F. Doolittle. Uprooting the tree of life.Scientific American, 262(2):90–95,2000.

[38] W. F. Doolittle. The attempt on the life of the Tree of Life: Science, philoso-phy, and politics.Biol. Philos., 25:455–473, 2010.

Page 58: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

58 Tanja Gesell and Peter Schuster

[39] W. F. Doolittle and E. Bapteste. Pattern pluralism and the Tree of Life hy-pothesis.Proc. Natl. Acad. Sci. USA, 104:2043–2049, 2007.

[40] L. Duret and N. Galtier. The covariation between tpa deficiency, cpg defi-ciency, and g+c content of human isochores is due to a mathematical artifact.Mol Biol Evol, 17(11):1620–1625, Nov 2000.

[41] J. S. Edwards, R. U. Ibarra, and B. Ø. Palsson.In silico predictions ofes-cherichia colimetabolic capabilities are consistent with experimental data.Nature Biotechnology, 19:125–130, 2001.

[42] M. Eigen. Selforganization of matter and the evolutionof biological macro-molecules.Naturwissenschaften, 58:465–523, 1971.

[43] M. Eigen. Error catastrophe and antiviral strategy.Proc. Natl. Acad. Sci. USA,99:13374–13376, 2002.

[44] M. Eigen, J. McCaskill, and P. Schuster. Molecular quasispecies.J. Phys. Chem., 92:6881–6891, 1988.

[45] M. Eigen, J. McCaskill, and P. Schuster. The molecular quasispecies.Adv. Chem. Phys., 75:149–263, 1989.

[46] M. Eigen and P. Schuster. The hypercycle. A principle ofnatural self-organization. Part A: Emergence of the hypercycle.Naturwissenschaften,64:541–565, 1977.

[47] M. Eigen and P. Schuster. The hypercycle. A principle ofnatural self-organization. Part B: The abstract hypercycle.Naturwissenschaften, 65:7–41,1978.

[48] M. Eigen and P. Schuster. The hypercycle. A principle ofnatural self-organization. Part C: The realistic hypercycle.Naturwissenschaften, 65:341–369, 1978.

[49] M. Eigen and P. Schuster. Stages of emerging life - Five principles of earlyorganization.J. Mol. Evol., 19:47–61, 1982.

[50] M. Eigen and R. Winkler-Oswatitsch. Transfer-RNA: Theearly aadaptor.Naturwissenschaften, 68:217–228, 1981.

[51] S. F. Elena, V. S. Cooper, and R. E. Lenski. Punctiated evolution caused byselection of rare beneficial mutants.Science, 272:1802–1804, 1996.

[52] A. D. Ellington and J. W. Szostak.In vitro selection of RNA molecules thatbind specific ligands.Nature, 346:818–822, 1990.

[53] A. Fels, K. Hu, and D. Riesner. Transcription of potato spindle tuber viroidby RNA polymerase II starts predominantly at two specific sites. NucleicAcids Research, 29:4589–4597, 2001.

[54] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likeli-hood approach.J. Mol. Evol., 17:368–376, 1981.

[55] J. Felsenstein.Inferring Phylogenies. Sinauer Associates, Sunderland, Mas-sachusetts, 2004.

[56] J. Felsenstein and G. A. Churchill. A hidden markov model approach tovariation among sites in rate of evolution.J. Mol. Evol., 13:92–104, 1996.

[57] W. M. Fitch. Toward defining the course of evolution: Minimum change fora specific tree topology. .Syst. Zool, 20:406–416, Nov 1971.

Page 59: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 59

[58] C. Flamm, W. Fontana, I. L. Hofacker, and P. Schuster. RNA folding atelementary step resolution.RNA, 6:325–338, 2000.

[59] W. Fletcher and Z. Yang. INDELible: a flexible simulatorof biological se-quence evolution.Mol. Biol. Evol., 26:1879–1888, Aug 2009.

[60] W. Fontana, W. Schnabl, and P. Schuster. Physical aspects of evolutionaryoptimization and adaptation.Phys. Rev. A, 40:3301–3321, 1989.

[61] W. Fontana and P. Schuster. A computer model of evolutionary optimization.Biophys. Chem., 26:123–147, 1987.

[62] W. Fontana and P. Schuster. Shaping space: The possibleand the attainablein RNA genotype-phenotype mapping.J. Theor. Biol., 194:491–515, 1989.

[63] W. Fontana and P. Schuster. Continuity in evolution. Onthe nature of transi-tions. Science, 280:1451–1455, 1998.

[64] C. W. Gardiner. Stochastic Methods. A Handbook for the Natural and So-cial Sciences. Springer Series in Synergetics. Springer-Verlag, Berlin, fourthedition, 2009.

[65] T. Gesell.A Phylogenetic Definition of Structure. PhD thesis, University ofVienna, August 2009.

[66] T. Gesell and A. von Haeseler. In silico sequence evolution with site-specificinteractions along phylogenetic trees.Bioinformatics, 22:716–722, 2006.

[67] T. Gesell and S. Washietl. Dinucleotide controlled null models for compara-tive rna gene prediction.BMC Bioinformatics, 9:248–264, May 2008.

[68] D. T. Gillespie. A general method for numerically simulating the stochastictime evolution of coupled chemical reactions.J. Comp. Phys., 22:403–434,1976.

[69] D. T. Gillespie. A rigorous derivation of the chemical master equation.Phys-ica A, 188:404–425, 1992.

[70] D. T. Gillespie. Stochastic simulation of chemical kinetics.Annu. Rev. Phys. Chem., 58:35–55, 2007.

[71] N. Goldman and Z. Yang. A codon-based model of nucleotide substitutionfor protein-coding DNA sequences.Mol. Biol. Evol., 11:725–736, Sep 1994.

[72] P. A. Goloboff, S. A.Catalano, J. M. Mirande, C. A. Szumik, J. S. Arias,M. Kallersjo, and J. S. Farris. Phylogenetic analysis of 73 060 taxa corrobo-rates major eukayrotic groups.Cladistics, 25:211–230, 2009.

[73] S. Gottesman. The small RNA regulators ofescherichia coli: Roles andmechanisms.Annu. Rev. Microbiol., 58:303–328, 2004.

[74] N. Grassly, J. Adachi, and A. Rambaut. PSeq-Gen: An application for theMonte Carlo simulation of protein sequence evolution alongphylogenetictrees.Comput. Appl. Biosci., 13:559–560, 1997.

[75] P. E. Griffiths. In what sense does ’nothing make sense except in the light ofevolution’?Acta Biotheoretica, 57:11–32, 2009.

[76] M. Guell, V. van Noort, E. Yus, W.-H. Chen, J. Leigh-Bell, K. Michalodim-itrakis, T. Yamada, M. Arumugam, T. Doerks, S. Kuhner, M. Rode,M. Suyama, S. Schmidt, A.-C. Gavin, P. Bork, and L. Serrano. Transcrip-tome complexity in a genome-reduced bacterium.Science, 326:1268–1271,2009.

Page 60: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

60 Tanja Gesell and Peter Schuster

[77] D. L. Hartl and A. G. Clark. Principles of Population Genetics. SinauerAssociates, Sunderland, MA, 1998.

[78] M. Hasegawa, H. Kishino, and T. Yano. Dating of the human–ape splitting bya molecular clock of mitochondrial DNA.J. Mol. Evol., 22:160–174, 1985.

[79] J. Hein, M. H. Schierup, and C. Wiuf.Gene Genealogies, Variation andEvolution: A Primer in Coalescent Theory. Oxford University Press, NewYork, 2005.

[80] S. Y. W. Ho, R. Lanfear, L. Bromham, M. J. Phillips, J. Soubrier, A. G. Ro-drigo, and A. Cooper. Time-dependent rates of molecular evolution. Molec-ular Ecology, 20:3087–3101, 2011.

[81] I. L. Hofacker, M. Fekete, and P. F. Stadler. Secondary structure predictionfor aligned RNA sequences.J. Mol. Biol., 319:1059–1066, 2002.

[82] C. Hudelot, H. Gowri-Shankar, M. Rattray, and P. Higgs.RNA-based phylo-genetic methods: Application to mammalian mitochondrial RNA sequences.Mol. Phylogenet. Evol., 2003.

[83] D. H. Huson, R. Rupp, and C. Scornavacca.Phylogenetic Networks. Cam-bridge University Press, Cambridge, 2010.

[84] M. A. Huynen, P. F. Stadler, and W. Fontana. Smoothness within ruggedness.The role of neutrality in adaptation.Proc. Natl. Acad. Sci. USA, 93:397–401,1996.

[85] H. Innan and W. Stephan. Selection intensity against deleterious mutations inRNA secondary structures and rate of compensatory nucleotide substitutions.Genetics, 159:389–399, Sep 2001.

[86] C. Jackel, P. Kast, and D. Hilvert. Protein design by directed evolution.Annu. Rev. Biophys., 37:153–173, 2008.

[87] J. L. Jensen and A.-M. K. Pedersen. Probabilistic models of DNA sequenceevolution with context dependent rates of substitution.Adv. Appl. Prob.,32:449–467, 2000. pages in jstore: 499-517.

[88] B. L. Jones, R. H. Enns, and S. S. Rangnekar. On the theoryof selection ofcoupled macromolecular systems.Bull. Math. Biol., 38:15–28, 1976.

[89] G. F. Joyce. Directed evolution of nucleic acid enzymes.Annu. Rev. Biochem., 73:791–836, 2004.

[90] G. F. Joyce. Forty years ofin vitro evolution. Angew. Chem. Internat. Ed.,46:6420–6436, 2007.

[91] T. H. Jukes and C. R. Cantor. Evolution of protein molecules. In H. N. Munro,editor,Mammalian Protein Metabolism, volume 3, pages 21–123. AcademicPress, New York, 1969.

[92] A. Keller, F. Forster, T. Muller, T. Dandekar, J. Schultz, and M. Wolf. In-cluding RNA secondary structures improves accuracy and robustness in re-construction of phylogenetic trees.Biol. Direct, 5:4, 2010.

[93] M. Kimura. A simple method for estimating evolutionaryrates of base sub-stitutions through comparative studies of nucleotide sequences.J. Mol. Evol.,16:111–120, Dec 1980.

[94] M. Kimura. The Neutral Theory of Molecular Evolution. Cambridge Uni-veresity Press, Cambridge, UK, 1983.

Page 61: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 61

[95] S. Klussmann, editor.The Aptamer Handbook. Functional Oligonucleotidesand Their Applications. Wiley-VCh Verlag, Weinheim, DE, 2006.

[96] B. Knudsen and J. Hein. Pfold: Rna secondary structure prediction usingstochastic context-free grammars.Nucleic Acids Res, 31(13):3423–8, Jul 12003.

[97] S. L. Kosakovsky Pond, S. D. W. Frost, and S. Muse. HyPhy:hypothesistesting using phylogenies.Bioinformatics, 21:676–679, 2005.

[98] S. Kuhner, V. van Noort, M. J. Betts, A. Leo-Macias, C. Batisse, M. Rode,T. Yamada, T. Maier, S. Bader, P. Beltran-Alvarez, D. Casta˜no-Diez, W.-H.Chen, D. Devos, M. Guell, T. Norambuena, I. Racke, V. Rybin,A. Schmidt,E. Yus, R. Aebersold, R. Herrmann, B. Bottcher, A. S. Frangakis, R. B.Russell, L. Serrano, P. Bork, and A.-C. Gavin. Proteome organization in agenome-reduced bacterium.Science, 326:1235–1240, 2009.

[99] S. Kumar. Molecular clocks: Four decades of evolution.Nature ReviewsGenetics, 6:654–662, 2005.

[100] C. G. Kurland, B. Canback, and O. G. Berg. Horizontal gene transfer: acritical view. Proc. Natl. Acad. Sci. U.S.A., 100:9658–9662, Aug 2003.

[101] R. E. Lenski, M. R. Rose, S. C. Simpson, and S. C. Tadler.Long-term ex-perimental evolution inescherichia coli. I. Adaptation and divergence during2,000 generations.The American Naturalist, 38:1315–1341, 1991.

[102] R. E. Lenski and M. Travisano. Dynamics of adaptation and di-versification: A 10,000-generation experiment with bacterial populations.Proc. Natl. Acad. Sci. USA, 91:6808–6814, 1994.

[103] M. Mandal, B. Boese, J. E. Barrick, W. C. Winkler, and R.R. Breaker. Ri-boswitches control fundamental biochemical pathways inbacillus subtilisand other bacteria.Cell, 113:577–596, 2003.

[104] M. Mann and K. Klemm. Efficient exploration of discreteeenergy land-scapes.Phys. Rev. E, 83:011113, 2011.

[105] J. Maynard Smith.Evolutionary Genetics. Oxford University Press, Oxford,UK, second edition, 1998.

[106] S. Meyer and A. von Haeseler. Identifying site-specific substitution rates.Mol. Biol. Evol., 20:182–189, Feb 2003.

[107] G. J. Morgan. Emile Zuckerkandl, Linus Pauling and themolecular evolu-tionary clock.J. History of Biology, 31:155–178, 1998.

[108] B. R. Morton. Neighboring base composition and transversion/transition biasin a comparison of rice and maize chloroplast noncoding regions.Proc. Natl.Acad. Sci. U.S.A., 92:9717–9721, Oct 1995.

[109] J. D. Murray. Mathematical Biology I: An Introduction. Springer-Verlag,New York, third edition, 2002.

[110] S. V. Muse. Evolutionary analyses of DNA sequences subject to constraintson secondary structure.Genetics, 139:1429–1439, 1995.

[111] S. V. Muse and B. S. Gaut. A likelihood approach for comparing synony-mous and nonsynonymous nucleotide substitution rates, with application tothe chloroplast genome.Mol. Biol. Evol., 11:715–724, Sep 1994.

Page 62: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

62 Tanja Gesell and Peter Schuster

[112] M. Nei and S. Kumar.Molecular Evolution and Phylogenetics. Oxford Uni-versity Press, New York, 2000.

[113] J. S. Nicholas, D. C. Hoyle, and P. G. Higgs. RNA sequence evolution withsecondary structure constraints: Comparison of substitution rate models us-ing maximum-likelihood methods.Genetics, 157:399–411, 2000.

[114] J. D. Orth, I. Thiele, and B. Ø. Palsson. What is flux balance analysis?NatureBiotechnology, 28:245–248, 2010.

[115] N. R. Pace. Mapping the tree of life: Progress and prospects. Micro-biol. Mol. Biol. Rev., 73:565–576, 2009.

[116] D. Papadopoulos, D. Schneider, J. Meies-Eiss, W. Arber, R. E. Lenski, andM. Blot. Genomic evolution during a 10,000-generation experiment withbacteria.Proc. Natl. Acad. Sci. USA, 96:3807–3812, 1999.

[117] A.-M. Pedersen and J. Jensen. A dependent rates model and MCMC basedmethodology for the maximum likelihood analysis of sequences with over-lapping reading frames.Mol. Biol. Evol., 18:763–776, 2001.

[118] J. S. Pedersen, G. Bejerano, A. Siepel, K. Rosenbloom,K. Lindblad-Toh,E. S. Lander, J. Kent, W. Miller, and D. Haussler. Identification and classi-fication of conserved rna secondary structures in the human genome.PLoSComput Biol, 2(4):e33, 04 2006.

[119] L. Peliti and B. Derrida. Evolution in a flat fitness landscape.Bull. Math. Biol., 53:355–382, 1991.

[120] P. E. Phillipson and P. Schuster.Modeling by Nonlinear Differential Equa-tions. Dissipative and Conservative Processes, volume 69 ofWorld ScientificSeries on Nonlinear Science A. World Scientific, Singapore, 2009.

[121] A. Rambaut and N. C. Grassly. Seq-Gen: An application for the Monte Carlosimulation of DNA sequence evolution along phylogenetic trees. Comput.Appl. Biosci., 13:235–238, 1997.

[122] B. Rannala and Z. Yang. Probability distribution of molecular evolutionarytrees: a new method of phylogenetic inference.J. Mol. Evol., 43:304–311,Sep 1996.

[123] M. D. Rasmussen and M. Kellis. Accurate gene-tree reconstruction by learn-ing gene- and species-specific substitution rates across multiple completegenomes.Genome Res., 17:1932–1942, Dec 2007.

[124] J. S. Reader and G. F. Joyce. A ribozyme composed of onlytwo differentnucleotides.Nature, 420:841–844, 2002.

[125] E. Rivas and S. R. Eddy. Noncoding RNA gene detection using comparativesequence analysis.BMC Bioinformatics, 2:8–8, 2001.

[126] D. M. Robinson, D. T. Jones, H. Kishino, N. Goldman, andJ. L. Thorne.Protein evolution with dependence among codons due to tertiary structure.Mol. Biol. Evol., 20:1692–1704, 2003.

[127] F. Rodriguez, J. L. Oliver, A. Main, and J. R. Medina. The general stochasticmodel of nucleotide substitution.J. Theor. Biol., 142:485–501, 1990.

[128] A. J. Roger and L. A. Hug. The origin and diversificationof eukary-otes: Problems with molecular phylogenies and molecular clock estimation.Proc. Trans. Roy. Soc. B, 361:1039–1054, 2006.

Page 63: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 63

[129] A. Rzhetsky. Estimating substitution rates in ribosomal RNA genes.Genet-ics, 141:771–783, 1995.

[130] D. B. Saakian, C. K. Biebricher, and C.-K. Hu. Phase diagram for theEigen quasispecies theory with a trancated fitness landscape. Phys. Rev. E,79:041905, 2009.

[131] N. Saitou and M. Nei. The neighbor–joining method: A new method forreconstructing phylogenetic trees.Mol. Biol. Evol., 4:406–425, 1987.

[132] Y. Satta and N. Takahata. DNA Archives and Our Nearest Relative: TheTrichotomy Problem Revisited.Molecular Phylogenetics and Evolution,14(2):259–275, 2000.

[133] N. J. Savill, D. C. Hoyle, and P. G. Higgs. RNA sequence evolution with sec-ondary structure constraints: Comparison of substitutionrate models usingmaximum-likelihood methods.Genetics, 157:399–411, 2000.

[134] M. Schoniger and A. von Haeseler. A stochastic model for the evolution ofautocorrelated DNA sequences.Mol. Phylogenet. Evol., 3:240–247, 1994.

[135] M. Schoniger and A. von Haeseler. Performance of the maximum likelihood,neighbor joining, and maximum parsimony methods when sequence site arenot independent.Systematic Biol., 44:533–547, 1995.

[136] M. Schoniger and A. von Haeseler. Simulating efficiently the evolution ofDNA sequences.Comput. Appl. Biosci., 11:111–115, 1995.

[137] P. Schuster. Molecular insight into the evolution of phenotypes. In J. P.Crutchfield and P. Schuster, editors,Evolutionary Dynamics – Exploring theInterplay of Accident, Selection, Neutrality, and Function, pages 163–215.Oxford University Press, New York, 2003.

[138] P. Schuster. Prediction of RNA secondary structures:From theory to modelsand real molecules.Reports on Progress in Physics, 69:1419–1477, 2006.

[139] P. Schuster. Mathematical modeling of evolution. Solved and open problems.Theory in Biosciences, 130:71–89, 2011.

[140] P. Schuster, W. Fontana, P. F. Stadler, and I. L. Hofacker. From se-quences to shapes and back: A case study in RNA secondary structures.Proc. R. Soc. Lond. B, 255:279–284, 1994.

[141] P. Schuster and J. Swetina. Stationary mutant distribution and evolutionaryoptimization.Bull. Math. Biol., 50:635–660, 1988.

[142] C. Semple and M. Semple.Phylogenetics. Oxford Lectures Series in Mathe-matics and its Applications. Oxford University Press, 2003.

[143] E. Seneta.Non-negative Matrices and Markov Chains. Springer-Verlag, NewYork, second edition, 1981.

[144] A. Serganov and D. J. Patel. Ribozymes, riboswitches and beyond: Regula-tion of gene expression without proteins.Nature Reviews Genetics, 8:776–790, 2007.

[145] A. Siepel and D. Haussler. Phylogenetic estimation ofcontext-dependentsubstitution rates by maximum likelihood.Mol. Biol. Evol., 21:468–488,2004.

[146] R. D. Sleator. Phylogenetics.Arch. Microbiol., 193:235–239, 2011.

Page 64: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

64 Tanja Gesell and Peter Schuster

[147] S. Smit, J. Widmann, and R. Knight. Evolutionary ratesvary among rRNAstructural elements.Nucleic Acids Res., 35:3339–3354, 2007.

[148] A. D. Smith, T. W. H. Lui, and E. R. M. Tillier. Empiricalmodels for substi-tution in ribosomal RNA.Mol. Biol. Evol., 21:419–427, 2004.

[149] S. Spiegelman. An approach to the experimental analysis of precellular evo-lution. Quart. Rev. Biophys., 4:213–253, 1971.

[150] W. Stephan. The rate of compensatory evolution.Genetics, 144:419–426,Sep 1996.

[151] J. Stoye, D. Evers, and F. Meyer. Rose: generating sequence families.Bioin-formatics, 14:157–163, 1998.

[152] J. Summers and S. Litwin. Examining the theory of errorcatastrophe.J. Virology, 80:20–26, 2006.

[153] J. Swetina and P. Schuster. Self-replication with errors - A model for polynu-cleotide replication.Biophys. Chem., 16:329–345, 1982.

[154] N. Takahata. Molecular clock: Ananti-neo-Darwinian legacy.Genetics,176:1–6, 2007.

[155] N. Takeuchi and P. Hogeweg. Error-thresholds exist infitness landscapeswith lethal mutants.BMC Evolutionary Biology, 7:15, 2007.

[156] K. Tamura and M. Nei. Estimation of the number of nucleotide substitutionsin the control region of mitochondrial DNA in humans and chimpanzees.Mol. Biol. Evol., 10:512–526, 1993.

[157] S. Tavare. Some probabilistic and statistical problems on the analysis of DNAsequences.Lec. Math. Life Sci., 17:57–86, 1986.

[158] H. Tejero, A. Marın, and F. Moran. Effect of lethalityon the extinction andon the error threshold of quasispecies.J. Theor. Biol., 262:733–741, 2010.

[159] C. J. Thompson and J. L. McBride. On Eigen’s theory of the self-organizationof matter and the evolution of biological macromolecules.Math. Biosci.,21:127–142, 1974.

[160] E. R. M. Tillier. Maximum likelihood with multi-parameter models of sub-stitution. J. Mol. Evol., 39:409–417, 1994.

[161] E. R. M. Tillier and R. A. Collins. Neighbor joining andmaximum likelihoodwith RNA sequences: Addressing the interdependence of sites. Mol. Biol.Evol., 12:7–15, 1995.

[162] E. R. M. Tillier and R. A. Collins. High apparent rate ofsimultaneous com-pensatory base-pair substitutions in ribosomal RNA.Genetics., 148:1993–2002, 1998.

[163] C. Tuerk and L. Gold. Systematic evolution of ligands by exponential enrich-ment: RNA ligands to bacteriophage T4 DNA polymerase.Science, 249:505–510, 1990.

[164] P. Tuffery. CS-PSeq-Gen: Simulating the evolution of protein sequence underconstraints.Bioinformatics, 18:1015–1016, 2002.

[165] T. Uzzell and K. W. Corbin. Fitting Discrete Probability Distributions toEvolutionary Events.Science, 172(3988):1089–1096, 1971.

Page 65: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

Phylogeny and evolution of structure 65

[166] Y. Van de Peer, J. M. Neefs, P. De Rijk, and R. De Wachter.Reconstructingevolution from eukaryotic small-ribosomal-subunit RNA sequences: calibra-tion of the molecular clock.J. Mol. Evol., 37:221–232, Aug 1993.

[167] N. G. van Kampen. A power series expansion of the masterequation.Can. Chem. Phys., 39:551–567, 1961.

[168] N. G. van Kampen. The expansion of the master equation.Adv. Chem. Phys.,34:245–309, 1976.

[169] G. P. Wagner and P. Krall. What is the difference between models of errorthresholds and Muller’s ratchet.J. Math. Biol., 32:33–44, 1993.

[170] A. Watts and G. Schwarz, editors.Evolutionary Biotechnology – From The-ory to Experiment, volume 66/2-3 ofBiophysical Chemistry, pages 67–284.Elesvier, Amsterdam, 1997.

[171] C. Weissmann. The making of a phage.FEBS Letters, 40:S10–S18, 1974.[172] T. Wiehe. Model dependency of error thresholds: The role of fitness

functions and contrasts between the finite and the infinite sites models.Genet. Res. Camb., 69:127–136, 1997.

[173] W. C. Winkler. Metabolic monitoring by bacterial mRNAs. Arch. Microbiol.,183:151–159, 2005.

[174] C. Woese. The universal ancestor.Proc. Natl. Acad. Sci. USA, 95:6854–6859,1998.

[175] M. T. Wolfinger, W. A. Svrcek-Seiler, C. Flamm, I. L. Hofacker, and P. F.Stadler. Efficient computation of RNA folding dynamics.J. Phys. A: Mathe-matical and General, 37:4731–4741, 2004.

[176] S. J. Wrenn and P. B. Harbury. Chemical evolution as a tool for moleculardiscovery.Annu. Rev. Biochem., 76:331–349, 2007.

[177] S. Wright. The roles of mutation, inbreeding, crossbreeding and selection inevolution. In D. F. Jones, editor,Int. Proceedings of the Sixth InternationalCongress on Genetics, volume 1, pages 356–366, Ithaca, NY, 1932. BrooklynBotanic Garden.

[178] Z. Yang. Maximum-likelihoodestimation of phylogenyfrom DNA sequenceswhen substitution rates differ over sites.J. Mol. Evol., 42:587–596, 1993.

[179] Z. Yang. Maximum likelihood phylogenetic estimationfrom DNA sequenceswith variable rates over sites: Approximative methods.J. Mol. Evol., 39:306–314, 1994.

[180] Z. Yang. Maximum-likelihood models for combined analyses of multiplesequences data.J. Mol. Evol., 42:587–596, 1996.

[181] Z. Yang. PAML: a program package for phylogenetic analysis by maximumlikelihood. Computer Applications in BioSciences, 13:555–556, 1997.

[182] J. Yu and J. L. Thorne. Dependence among sites in RNA evolution.Mol. Biol. Evol., 23:1525–1537, 2006.

[183] E. Yus, T. Maier, K. Michalodimitrakis, V. van Noort, T. Yamada, W.-H. Chen, J. A. Wodke, M. Guell, S. Martınez, R. Bourgeois, S. Kuhner,E. Raineri, I. Letunic, O. V. Kalinina, M. Rode, R. Herrmann,R. Gutierez-Gallego, R. B. Russell, A.-C. Gavin, P. Bork, and L. Serrano.Impact

Page 66: Phylogeny and evolution of structure - UGRbioinfo2.ugr.es/.../PhylogenyAndEvolutionOfStrcture_Schuster2012.pdfPhylogeny and evolution of structure ... euphoric introduction to evolution

66 Tanja Gesell and Peter Schuster

of genome reduction on bacterial metabolism and its regulation. Science,326:1263–1268, 2009.

[184] X. Zhong, A. J. Archual, A. A. Amin, and B. Ding. A genomic map of viroidRNA motifs critical for replication and system trafficking.The Plant Cell,30:35–47, 2008.

[185] X. Zhong, N. Leontis, S. Qian, A. Itaya, Y. Qi, K. Boris-Lawrie, and B. Ding.Tertiary structural and functional analyses of a viroid RNAmotif s criticalfor replication and system trafficking.The Plant Cell, 30:35–47, 2008.