Download - Concurrent origins of the genetic code and the …...2015/04/13 · 0 Overview The origins of the genetic code, the homochirality of life and the three domains of life are the critical

Concurrent origins of the genetic code and the homochiralityof life, and the origin and evolution of biodiversity.

Part II: Technical appendix

Dirson Jian LiDepartment of Applied Physics, Xi’an Jiaotong University, Xi’an 710049, China

Abstract

Here is Part II of my two-part series paper, which provides technical details and evidence for Part Iof this paper (see “Concurrent origins of the genetic code and the homochirality of life, and the originand evolution of biodiversity. Part I: Observations and explanations” on bioRxiv).

Contents0 Overview 1

1 The roadmap for the evolution of the genetic code and the origin of homochirality of life 31.1 Initiation of the roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Expansion of the genetic code along the roadmap . . . . . . . . . . . . . . . . . . . . . . . 41.3 The ending of the roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Origin of tRNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Codon degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6 Origin of homochirality of life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.7 Evidence for the roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Origin of the three domains 182.1 Genomic codon distributions reveal the relationship between the evolution of the genetic code

and the tree of life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Features of genomic codon distributions: observations and simulations . . . . . . . . . . . . 202.3 The evolution space and the tree of life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Reconstruction of the Phanerozoic biodiversity curve 253.1 Explanation of the trend of the Phanerozoic biodiversity curve based on genome size evolution 253.2 Explanation of the variation of the Phanerozoic biodiversity curve based on climatic and

eustatic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Reconstruction of the Phanerozoic biodiversity curve and the strategy of the evolution of life 34

References 36

Index 41

Supplementary Figures 43

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint

https://doi.org/10.1101/017988

0 Overview

The origins of the genetic code, the homochirality of life and the three domains of life are the critical eventsat the beginning of life. Most direct evidence for these primordial events has been erased; we have to turn tocircumstantial evidence in the genome sequences to reproduce the early history of life. In Part I, a roadmaptheory is proposed to explain the above events. It indicates that the genetic code and the homochiralityof life originated concurrently, which consequently determined the origin and evolution of the biodiversity.Evidence is the final arbiter for this theory. Here in Part II, based on biological and geological data, plenty ofevidence has been provided in detail from different perspectives.

Section 1 explains the roadmap for the evolution of the genetic code and the origin of the homochiralityof life (Fig 1, 2), which includes initiation, expansion and the ending of the roadmap (Fig 1A, 1B, 2B); theorigin of tRNAs (Fig 2A); cooperative recruitments of codons and amino acids (Fig 2B); roadmap symmetryand codon degeneracy (Fig 2C); chiral symmetry broken at the origin of life (Fig 2D); and a summary ofevidences that support the roadmap. Section 2 explains the origin of the three domains of life (Fig 3), whichincludes the relationship between the evolution of the genetic code and the tree of life based on the genomiccodon distributions (Fig 3A, S3a5); the origin order of Virus, Bacteria, Archaea and Eukarya according tothe recruitment order of codons (Fig 3C); construction of the evolution space; and construction of the treeof life based on the roadmap and genomic codon distributions (Fig 3D, S3d9). And Section 3 explainsthe Phanerozoic biodiversity curve qualitatively and quantitatively (Fig 4), which includes the split andreconstruction scheme for the Phanerozoic biodiversity curve (Fig 4A, 4B, 4C); the evolution of genomesize (Fig S4a2, S4a7); the Phanerozoic climatic curve (Fig S4b2); the Phanerozoic eustatic curve (Fig S4b3);tectonic cause of mass extinctions (Fig 4B); and explanation of the declining origination rate and extinctionrate (Fig 4D). Some new nomenclatures are defined when necessary in this paper, which are listed in theIndex.

There are 14 subfigures plus 53 supplementary subfigures to explain the origin and evolution of life basedon biological and geological data. Numbering rules are as follows: Appendix Section 1-3 correspond toSection 1-3 in the report respectively, and the supplementary subfigures Fig Sαβγ (α = 1, 2, 3, 4; β = a, b, c,d; γ = 1, 2, 3, ...) correspond to subfigures Fig αβ (α = 1, 2, 3, 4; β = A, B, C, D; with corresponding ‘αβ’)in the report respectively.

Section 1 (Appendix Section 1) : Fig 1, 2 (Fig S1, S2)

Fig 1A (Fig S1a1) , Fig 1B,

Fig 2A (Fig S2a1-3) , Fig 2B (Fig S2b1-5),

Fig 2C (Fig S2c1-3) , Fig 2D (Fig S2d1);

Section 2 (Appendix Section 2) : Fig 3 (Fig S3)


Fig 3C (Fig S3c1-2) , Fig 3D (Fig S3d1-10);

1


https://doi.org/10.1101/017988

Section 3 (Appendix Section 3) : Fig 4 (Fig S4)


Fig 4C (Fig S4c1) , Fig 4D (Fig S4d1-3).

2


https://doi.org/10.1101/017988

1 The roadmap for the evolution of the genetic code and the origin ofhomochirality of life

1.1 Initiation of the roadmap

The evolution of the genetic code can be divided into three stages: the initiation stage, the expansion stageand the ending stage.

In the beginning, there was an R (R denotes purine) single-stranded DNA Poly G (Fig 1A, 1B #1). Bycomplementary base pairing, a YR (Y denotes pyrimidine) double-stranded DNA Poly C · Poly G formed.And by base triplex CG ∗ G, a YR ∗ R1 triple-stranded DNA Poly C · Poly G ∗ Poly G formed (Fig 1A, 1B#1). The third R1 strand Poly G separated out of this YR ∗R1 triple-stranded DNA, which then formed a newY1R1 double-stranded DNA Poly C · Poly G. So far, there was only initial codon pair GGG · CCC (Fig 1A,1B #1).

Starting from this initial #1 codon pair and by the base substitutions throughout the four hierarchies, thebase pairs from #2 to #32 recruited so that the whole genetic code table is fulfilled (Fig 1). In this paper, thetechnical process is illustrated in the roadmap for the evolution of the genetic code (or the roadmap for short)(Fig 1A). The relative stabilities of all the 16 possible base triplexes are listed from weak to strong as follows[1] [2]:

(−) GC ∗ A, AT ∗C, AT ∗ A

(+) CG ∗G, T A ∗C, T A ∗ A, T A ∗G, GC ∗C, GC ∗G, AT ∗ T

(++) CG ∗ A, CG ∗ T, GC ∗ T

(3+) AT ∗G

(4+) CG ∗C, T A ∗ T

The relative stability of base triplexes played a crucial role in the base substitutions along the roadmap.

A weaker base triplex tends to be substituted spontaneously by a stronger one, in the course of repeatedcombinations and separations of strands for the triple-stranded DNA. Only three kinds of substitutions ofbase triplexes are practically required in the roadmap: substitution of (+) CG ∗ G by (++) CG ∗ A, with thetransition from G to A in the third R strand; substitution of (+) CG ∗G by (4+) CG ∗C, with the transversionfrom G to C in the third R strand; substitution of (+) GC ∗ C by (++) GC ∗ T , with the transition from C toT in the third R strand (Fig 1A, Sla1). The transition from G to A is of the most common substitution in theroadmap (Fig 1A). All the 8 codon pairs in Route 0 were formed by the transition from G to A. Consideringthe dislocation of bases in Poly C ·Poly G∗Poly G, only one transversion from G to C was enough to blazed anew path other than Route 0 by bringing the starting codon pairs GGC ·GCC, GCG ·CGC and CGG ·CCG forRoute 1, Route 2 and Route 3 at positions #2, #7 and #10 in the roadmap, respectively (Fig 1A). Consideringthe dislocation of bases similarly, only one transition from C to T was enough to blaze a new path so as to

3


https://doi.org/10.1101/017988

form the remaining codon pairs by substitutions at positions #6, #19 and #12 in the roadmap, respectively (Fig1A). The length of the genetic code are not greater than 3 bases because such supposed genetic code cannotbe fulfilled by the demand of increasing stability of the base triplexes; namely the unavoidable instabilities ofthe base triplexes prohibited the k-base genetic code (k , 3) in practice.

In the initiation stage of the roadmap, the codon pairs from #1 to #6 were recruited by the roadmap, whichconstituted the initial subset of the genetic code:

#1 GGG ·CCC, #2 GGC ·GCC, #3 GGA ·UCC, #4 GAG ·CUC, #5 GAC ·GUC, #6 GGU ·ACC.

And in this stage were recruited the earliest 9 amino acids in order: 1Gly, 2Ala, 3Glu, 4Asp, 5Val, 6Pro,7S er, 8Leu, 9Thr, all of which belong to Phase I amino acids [3] [4]. Although the initial subset is small, thetwo essential features of the roadmap, pair connection and route duality, had taken shape in this initial stage(Fig 1A, 2B).

Pair connection is an essential feature of the roadmap. The connected codon pairs in the roadmapgenerally encode a common amino acid (Fig 1A, 2C). For instance, the pair connection #1−Gly−#2 indicatesthat both GGG in #1 and GGC in #2 encode the common amino acid Gly. Pair connections reveal the closerelationship between the recruitment of the codons and the recruitment of the amino acids.

Route duality is another essential feature of the roadmap, which shows the relationship of pair connectionsbetween different routes (Fig 1A, 2C). For instance, the route duality

#1 −Gly − #3 ∼ #2 −Gly − #6

indicates that the pair connection #1 −Gly − #3 in Route 0 and the pair connection #2 −Gly − #6 in Route 1are dual, which encode the common amino acid Gly. Route dualities generally exist between Route 0 andRoute 3, or between Route 1 and Route 2 (Fig 2C).

1.2 Expansion of the genetic code along the roadmap

The genetic codes evolved along the four routes Route 0 − 3 respectively; and the 8 codon pairs in eachroute evolved in the order of the four hierarchies Hierarchy 1 − 4 respectively (Fig 1A). The roadmap can bedivided into two groups: the early hierarchies Hierarchy 1 − 2 and the late hierarchies Hierarchy 3 − 4. Itcan also be divided into two groups: the initial route Route 0 (all-purine codons with all-pyrimidine codons)and the expanded routes Route 1 − 3 (Fig 1A, S3d2). These groupings will be used to explain the origin ofthe three domains in Subsection 2.3.

The genetic codes expanded spontaneously from the initial subset in the expansion stage of the roadmap.Each of the 6 codon pairs in the initial subset expanded to three codon pairs, respectively, by route dualities.

4


https://doi.org/10.1101/017988

Details are as follows. The codon pair #2 in the initial subset expanded to the three continual codon pairs #7,#8 and #9 by route duality

#2 − Ala − #8 ∼ #7 − Ala − #9.

The codon pair #1 in the initial subset expanded to the three continual codon pairs #10, #11 and #12 by routeduality

#1 − Ala − #11 ∼ #10 − Ala − #12.


#3 − Ala − #14 ∼ #13 − Ala − #15.


#6 − Ala − #17 ∼ #16 − Ala − #18.

The codon pair #5 in the initial subset expanded to the three codon pairs #19, #24 and #26 by route duality

#5 − Ala − #26 ∼ #19 − Ala − #24.

The codon pair #4 in the initial subset expanded to the three codon pairs #20, #25 and #27 by route duality

#4 − Ala − #27 ∼ #20 − Ala − #25.

According to the base substitutions in the roadmap, the recruitment order of the codon pairs from #1 to#32 is as follows (Fig 1A):

#1 GGG ·CCC, #2 GGC ·GCC, #3 GGA ·UCC, #4 GAG ·CUC, #5 GAC ·GUC, #6 GGU ·ACC,#7 GCG ·CGC, #8 AGC ·GCU, #9 GCA ·UGC, #10 CGG ·CCG, #11 AGG ·CCU, #12 UGG ·

CCA, #13 CGA · UCG, #14 AGA · UCU, #15 UGA · UCA, #16 ACG · CGU, #17 AGU · ACU,#18 ACA ·UGU, #19 GUG ·CAC, #20 CAG ·CUG, #21 GAU ·AUC, #22 AUG ·CAU, #23 GAA ·

UUC, #24 GUA · UAC, #25 UAG ·CUA, #26 AAC ·GUU, #27 AAG ·CUU, #28 CAA · UUG,#29 AUA · UAU, #30 AAU · AUU, #31 UAA · AUU, #32 AAA · UUU;

and the recruitment order of the amino acids from No.1 to No.20 is as follows (Fig 1A):

No.1 Gly, No.2 Ala, No.3 Glu, No.4 Asp, No.5 Val, No.6 Pro, No.7 S er, No.8 Leu, No.9 Thr,No.10 Arg, No.11 Cys, No.12 Trp, No.13 His, No.14 Gln, No.15 Ile, No.16 Met, No.17 Phe,No.18 Tyr, No.19 Asn, No.20 Lys.

The recruitment order of the codon pairs and the recruitment order of the amino acids are intricatelywell organised and coherent, according to the subtle roadmap (Fig 1A, 2B). In the initiation stage, firstly,

5


https://doi.org/10.1101/017988

the amino acid No.1 was recruited with the codon pair #1, remaining a vacant position. Subsequently, No.1and No.2 were recruited with the codon pair #2; No.1 was recruited with the codon pair #3, remaining avacant position; No.3 was recruited with the codon pair #4, remaining a vacant position; No.4 and No.5 wererecruited with the codon pair #5; No.6 filled up the vacant position of #1; No.7 filled up the vacant positionof #3; No.8 filled up the vacant position of #4; No.1 and No.9 were recruited with the codon pair #6 (Fig 2B).Thus the framework of the genetic code had been established at the end of the initiation stage. Hereafter, itwas always proper to recruit two amino acids that had been already recruited or were about to be recruitednext, such that no more positions remained vacant. Details are as follows. No.2 and No.10 amino acids wererecruited with the codon pair #7; and subsequently, No.2 and No.7 were recruited with #8; No.2 and No.11were recruited with #9; No.6 and No.10 were recruited with #10; No.6 and No.10 were recruited with #11;No.6 and No.12 were recruited with #12; No.7 and No.10 were recruited with #13; No.7 and No.10 wererecruited with #14; No.7 and stop were recruited with #15; No.9 and No.10 were recruited with #16; No.7and No.9 were recruited with #17; No.9 and No.11 were recruited with #18; No.5 and No.13 were recruitedwith #19; No.8 and No.14 were recruited with #20; No.4 and No.15 were recruited with #21; No.13 andNo.16 were recruited with #22; No.3 and No.17 were recruited with #23; No.5 and No.18 were recruitedwith #24; No.8 and stop were recruited with #25; No.5 and No.19 were recruited with #26; No.8 and No.20were recruited with #27; No.8 and No.14 were recruited with #28; No.15 and No.18 were recruited with #29;No.15 and No.19 were recruited with #30; No.8 and stop were recruited with #31; No.17 and No.20 wererecruited with #32 (Fig 2B).

Take for example from #1 to #29, the evolution of the genetic code along the roadmap can described indetails as follows (Fig 1A, 1B). Starting from the position #1 (Fig 1B #1), an R single-stranded DNA broughtabout a YR double-stranded DNA; next, the YR double-stranded DNA brought about a YR∗R1 triple-strandedDNA (the number 1 denotes #1, similar below); next, an R1 single-stranded DNA departed from the YR ∗ R1triple-stranded DNA; next, the R1 single-stranded DNA brought about a R1Y1 double-stranded DNA. Thus,the codon pair GGG ·CCC were achieved at #1. At the beginning of #7 (Fig 1B #7), the R1Y1 double-strandedDNA was renamed as Y1R1 double-stranded DNA, where the 180◦ rotation did not change the right-handedhelix; next, the Y1R1 double-stranded DNA brought about a Y1R1 ∗ R7 triple-stranded DNA, through thetransversion from G to C, where the stability (+) of CG ∗G increased to the stability (4+) of CG ∗C; next, anR7 single-stranded DNA departed from the Y1R1∗R7 triple-stranded DNA; next, the R7 single-stranded DNAbrought about a R7Y7 double-stranded DNA. Thus, the codon pair GCG ·CGC were achieved at #7. The caseof #19 is similar to #7 (Fig 1B #19); the codon pair GTG ·CAC were achieved through the transition from C

to T , where the stability (+) of GC∗C increased to the stability (2+) of GC∗T . The case of #24 is also similarto #7 (Fig 1B #24); the codon pair GT A · T AC were achieved through the common transition from G to A,where the stability (+) of CG ∗G increased to the stability (2+) of CG ∗ A. At the position #29 (Fig 1B #29),the codon pair GCG ·CGC in Y24R24 are non-palindromic in consideration that both GCG and CGC do notread the same backwards as forwards. In this case, a reverse operation is necessary so that the obtained codonpair CAT · ATG in y24r24 read reversely the same as the codon pair T AC · GT A in Y24R24. The processfrom y24r24 to R29Y29 is still similar to the case of #7; the codon pair AT A · T AT were achieved through

6


https://doi.org/10.1101/017988

the transition from G to A, where the stability (+) of CG ∗G increased to the stability (2+) of CG ∗ A. Otherprocesses in the roadmap are similar to the above example (Fig 1A, 1B). The reverse operation is unnecessaryfor the cases #2, #7, #10, #11, #3, #4, #16, #9, #19, #27, #23, #22, #24 after palindromic codon pairs andthe last one #32 (Fig 1A), whereas the reverse operation is necessary for the remaining cases #5, #6, #8, #12,#13, #14, #15, #17, #18, #20, #21, #25, #26, #28, #29, #30, #31 (Fig 1A).

1.3 The ending of the roadmap

Nowadays the genetic code table had been expanded from the 6 codon pairs in the initial subset to the 6 + 18codon pairs by route duality; the remaining 8 codon pairs were recruited into the genetic code table in theending stage of the roadmap (Fig 1A, 2B). There were 2 codon pairs remained in each of the four routesRoute 0 − 3 respectively. They satisfied pair connections as follows: #23 − Phe − #32, #21 − Ile − #30,#22 − Met/Ile − #29, #28 − Leu − #31 (Fig 2B). Two of them satisfied route duality (Fig 2B):

#21 − Ile − #30 ∼ #22 − Met/Ile − #29.

The last two stop codons appeared in the pair connection #25 − stop − #31 (Fig 1A, 2B). When the last twoamino acids were recruited through the base pairs #26 − Asn − #30 and #27 − Lys − #32, the codon UAG at#25 had to be selected as a stop codon. Due to no participation of corresponding tRNA, the codon UAA at#31 was selected as the last stop codon.

The non-standard codons also satisfy codon pairs and route dualities in the roadmap (Fig 1A). The codonpairs pertaining to non-standard codons are as follows: #11 − Arg (S er, stop) − #14, #4 − Leu (Thr) − #27in Route 0; none in Route 1; #22 − (Met) − #29 in Route 2; #20 − Leu (Thr,Gln) − #25, #12 − (Trp) − #15,#25− stop (Gln)/Leu−#31, #28−Leu (Gln)−#31 in Route 3. Majority of non-standard codons appear in thelast Route 3 (Fig 1A). Route dualities of non-standard codons exist between Route 0 and Route 3 (Fig 1A):

#4 − Leu (Thr) − #27 ∼ #20 − Leu (Thr) − #25

#11 − (stop) − #14 ∼ #12 − Trp/stop − #15,

where the first stop codon UGA at #15 is dual to the non-standard stop codons in Route 0.

The stabilities of base triplexes played crucial roles in the evolution of the genetic code. It had beenemphasised that the roadmap followed the strict rule that the stabilities of base triplexes increase gradually(Fig Sla1). Also note that the roadmap had tried its best to avoid the unstable triplex DNA. Among the 16possible base triplexes, there are three relatively unstable base triplexes: GC ∗ A, AT ∗ C and AT ∗ A. Sothe statistical ratio of instability for the base triplexes is 3/16. However, the ratio of instability for the basetriplexes in the roadmap is much smaller. There are 49 triplex DNAs through #1 to #32 in the roadmap,which involve 3 × 49 = 147 base triplexes (Fig 1A). The relatively unstable base triplexes GC ∗ A andAT ∗ C have not appeared in the roadmap; only the relatively unstable base triplex AT ∗ A has appearedinevitably for 7 times in the reverse operations so as to fulfil all the permutations of 64 codons (Fig 1A). The

7


https://doi.org/10.1101/017988

ratio of instability 7/147 in the roadmap is much smaller than the ratio of instability 3/16 by the statisticalrequirement. When the relatively unstable AT ∗ A appears at the positions #15, #17, #21, #25, #29, #30 and#31, both stabilities of the other two base triplexes in the triplex DNA are (4+) (Fig 1A), which compensatesthe instability of the triplex DNA to some extent. The amino acid Ile, whose degeneracy uniquely is three,occupied three positions #21, #29 and #30 among those 7 positions. And the three stop codons occupiedother three neighbour positions #15, #25 and #31 (Fig 1A). The route dualities played significant roles inthe expansion stage (Fig 1A, 2B). The hence remnant codons were chosen as the stop codons. The first stopcodon UGA appeared at the position #15 where the relatively unstable AT ∗ A appeared firstly (Fig 1A),hence it is possible that the instability might be helpful to achieve the function of stop codons. The stopcodon appeared as early as the midway of the evolution of the genetic code (Fig 1A, 2B), which indicatesthat the genetic code had been taken shape around the midway to promote the formation of the primitive life.Not until the fulfilment of the genetic code, did the translation efficiency increase notably by recognising allthe 64 codons.

1.4 Origin of tRNAs

The roadmap illustrates the coevolution of the genetic code and the amino acids, in which tRNAs playedmediate roles. The recruitment and expansion of tRNAs underlay pair connections and route dualities. Thissubsection explains the origin of tRNAs according to the roadmap.

There are 16 branch nodes and 16 leaf nodes in the roadmap (Fig 1A, 2A). The leaf nodes are as follows:#3, #14, #27, #32 in Route 0; #8, #17, #26, #30 in Route 1; #9, #18, #22, #29 in Route 2; #13, #15, #28, #31in Route 3. And the branch nodes are as follows: #1, #4, #11, #23 in Route 0; #2, #5, #6, #21 in Route 1;#7, #16, #19, #24 in Route 2; #10, #12, #20, #25 in Route 3. For each branch node, the third strand of thetriplex nucleic acids is of DNA (Fig 1A). Nevertheless the third strand is an RNA strand (Fig 2A). Namelythe first two DNA strands are combined by a third RNA strand. Such triplex nucleic acids are of two types:the triplex nucleic acid YR ∗ Yt and the triplex nucleic acid YR ∗ Rt (Fig 2A), where the subscript t denotestRNA. After departing from the triplex nucleic acids, the combination of the two complementary single RNAstrands Yt and Rt fold into a cloverleaf-shaped tRNA [5] [6] [7] [8] [9], which acquired a certain anticodoncorresponding to its position in the roadmap (Fig S2a3).

For example, the tRNA t1 can form from the two triplex nucleic acids Y1R1 ∗ Yt1 and Y1R1 ∗ Rt1 in theposition #1(Fig 2A #1). The triple code CCC in the strand Yt1 is palindromic, while the triple code GGG in thestrand Rt1 is palindromic. The two complementary strands Yt1 and Rt1 can combine into a cloverleaf-shapedtRNA t1 by joining together, complementary pairing and folding (Fig S2a3). Thus anticodon arm of t1contains the anticodon CCC, which corresponds to Gly; namely t1 transports Gly. As another example, thecodon in the position #11 is not palindromic, where two tRNAs t7 and t10 are folded respectively (Fig 2A#11). Two complementary single RNA strands Yt11 and Rt11 form by reverse cooperation. They combineinto the tRNA t7 by joining together, complementary pairing and folding, which contains the anticodon GGA

8


https://doi.org/10.1101/017988

and carry S er (Fig S2a3). Furthermore, another two complementary single RNA strands yt11 and rt11 canform. They combine into the tRNA t10, which contains the anticodon CCU and carry Arg (Fig S2a3).

There are 4 palindromic codons: #1 CCC ·GGG, #4 CUC ·GAG, #7 CGC ·GCG, #19 CAC ·GUG amongthe 16 branch nodes in the roadmap (Fig S2a3). Accordingly there are 12 non-palindromic codons among thebranch nodes in the positions #2, #5, #6, #10, #11, #12, #16, #20, #21, #23, #24 and #25. Due to the bijectionbetween Route 1 and Route 3 in the sense of reverse relationship (Fig 1A, 2A), the complementary single RNAstrands obtained in the two routes are same to each other (Fig 2A). Thus, there are totally 4+ (12−4)×2 = 20pairs (4 palindromic codons, and the 12 non-palindromic codons minus 4 identities between Route 1 andRoute 3) of complementary single RNA strands, which combine into 20 tRNAs respectively (Fig S2a1, S2a3).

The anticodons of tRNAs such obtained may originate from either the triplet code in Yt or the tripletcode in Rt of the 20 possible pairs of triplet codes in the branch nodes (Fig S2a3). An assignment schemecan be obtained so that 20 anticodons chosen from the 20 pairs of possible triplet codes can be assigned tothe 20 canonical amino acids respectively (Fig S2a2), which is as follows in details (Fig 2A, S2a2, S2a3).Note that these tRNAs are named after the series number of the recruitment order of the amino acids in theroadmap. There are 6 tRNAs (t1:1Gly, t3:3Glu, t7:7S er, t10:10Arg, t17:17Phe, t20:20Lys) from Route 0, 2tRNAs (t4:4Asp, t19:19Asn) from Route 1, 6 tRNAs (t2:2Ala, t9:9Thr, t11:11Cys, t13:13His, t16:16Met,t18:18Tyr) from Route 2, and 6 tRNAs (t5:5Val, t6:6Pro, t8:8Leu, t12:12Trp, t14:14Gln, t15:15Ile) fromRoute 3. The anticodons of t4 and t5 originated respectively from yt5 and Rt20 of the same pairs yt5rt5 orYt20Rt20 (Fig 2A). The tRNA t19 originated from the combination of Yt30 and Rt30, whose anticodon AUU

came from a transition from C to U (Fig 2A, S2a2).

There are 20 canonical amino acids encoded by the degenerate codons. The above assignment scheme hasprovided an argument to account for the number of the canonical amino acids (Fig 2A, S2a2, S2a3). It willbe explained, in Subsection 2.3, that not only the 64 codons result from the 64 permutations of the 4 bases,but also the 20 canonical amino acids relate to the 20 combinations of the 4 bases (Fig S3d1). The genomiccodon distributions can be divided into 20 classes according to the 20 combinations of the 4 bases (Fig S3d1).Given the base compositions p(i), i = G, C, A, T , the productsp(i)∗ p( j)∗ p(k), i, j, k = G, C, A, T are equalfor each of the 20 combinations. The 20 combinations of the 4 bases can be divided into 4 groups < G >,< C >, < A >, < T > (Fig S3d1). Hierarchy 1 and Hierarchy 2 corresponds < G > and < C >; Hierarchy 3and Hierarchy 4 corresponds to < A > and < T >. Their positions in the roadmap are

Hierarchy 1 − 2 Y : < G > , Hierarchy 1 − 2 R : < C >,

Hierarchy 3 − 4 Y : < A > , Hierarchy 3 − 4 R : < T > .

Route 0 and Route 1 − 3 are distinguished in each of the four groups. Take < G > for example, < G, G, G >

an < G, G, A > belong to Route 0, while < G, G, C >, < G, G, T > and < G, C, A > belong to Route 1−3.The 20 tRNAs transfer 20 amino acids, the number of which is equal to the number of the 20 classes ofgenomic codon distributions. This is helpful to improve the translation efficiency, because it is easier for atRNA to recognise a certain class of genomic codon distributions with the same codon interval distance.

9


https://doi.org/10.1101/017988

The wobble pairing rules can be explained by the theory on the origin of tRNAs. The transition from C

to T occurred in the position #6, which resulted in the wobble pairing rule G : U or C (Fig 2A). Taking y2r2as a template, yt2 with GCC is formed by the triplex base paring, while rt2 with GGC and r′t 2 with GGU areformed, where the transition from C to U occurred in the formation of r′t 2. The complementary strands yt2and r′t 2 combine into a tRNA with anticodon GCC, where G at the first position of the anticodon of the tRNAis paired with U at the third position of the triple code of an additional single strand r′t 2. It implies that thewobble pairing rule G : U had been established as early as the end of the initiation stage of the roadmap. Thetransition from C to T occurred in the position #12, which resulted in the wobble pairing rule U : G or A (Fig2A). Taking y10r10 as a template, yt10 with CCG is formed by the triplex base paring, and rt10 with CGG

and r′t 10 with UGG are also formed, where the transition from C to U occurred in the formation of r′t 10.The complementary strands yt10 and r′t 10 combine into a tRNA with anticodon UGG, where U at the firstposition of the anticodon of the tRNA is paired with G at the third position of the triple code of an additionalsingle strand yt10. The above explanation of the wobble pairing rules by tRNA mutations is supported bythe observations of nonsense suppressor. For instance, the wobble pairing rule C : A for a UGA suppressorcan be established by a transition from G to A at the 24 − th position of tRNATrp. The wobble pairing rulesG : U or C and U : G or A had been established early in the evolution of the genetic code (Fig 2A), whichcontinued to flourish so as to make full use of the tRNAs in short supply.

The 20 anticodons in the tRNAs from t1 to t20 correspond to the 20 amino acids, respectively. Hence,the corresponding 20 codons can be regarded as the minimum subset of the genetic code that encodes allthe amino acids (Fig S2a2). The minimum subset consists of GGG, GCG, GAG, GAC, GUG, CCG, UCC,CUA, ACG, AGG, UGC, UGG, CAC, CAG, AUC, AUG, UUC, UAC, AAU, AAG. For each amino acid, itscorresponding codon in the minimum subset generally recruited earlier than the other corresponding codonsout of the minimum subset (Fig S2c2). The codons in the minimum subset whose first base are purinesgenerally correspond to the anticodons that come from the strands Yt, except for GUC and AUC (Fig S2a2);those whose first base are pyrimidines generally correspond to anticodons that come from the strands Rt,except for CAG and UGG (Fig S2a2). The codons in the minimum subset are generally situated in the branchnodes, except for AAG, UCC, AAU, AUG and UGC (Fig 2A, S2a2). The numbers of codons in the minimumsubset in Route 0−3 are 6, 4, 6, 4 respectively (Fig 2A, S2a2). The minimum subset can be extended throughwobble pairings. Additional tRNAs can be combined according to the roadmap so as to recognise all thecodons.

1.5 Codon degeneracy

The four routes of the roadmap (Fig 1A) can be represented by four cubes (Fig 2C). The vertices of the cubesrepresent codon pairs, while the edges of the cubes represent substitution relationships (Fig 2C). Cubes areconvenient to demonstrate pair connections and route dualities, by which it is more intuitive to study thecodon degeneracy. The degeneracies 6, 4, 3, 2 or 1 for the 20 amino acids can be explained one by oneaccording to pair connections and route dualities in the roadmap (Fig 1A, 2C). The degeneracy 2 mainly

10


https://doi.org/10.1101/017988

results from pair connections. The degeneracy 4 or 6 mainly result from the expansion of the genetic codefrom the initial subset by route dualities pertaining to S er, Leu, Ala, Val, Pro and Thr (Fig 2B, 2C).

The degeneracy 6 for S er, Leu and Arg can be explained by pair connections and route dualities (Fig 1A,2C), where S er and Leu belong to the initial subset and Arg was recruited immediately after the initial subset.And all of them have appeared in Route 0. The 6 codons of S er satisfy both the pair connection #8−S er−#17and the route duality

#3 − S er − #14 ∼ #13 − S er − #15.

The 6 codons of Leu satisfy both the pair connection #28 − Leu − #31 and the route duality

#4 − Leu − #27 ∼ #20 − Leu − #25.

The 6 codons of Arg satisfy both the pair connection #7 − Arg − #16 and the route duality

#11 − Arg − #14 ∼ #10 − Arg − #13.

The degeneracy 4 for Gly, Ala, Val, Pro and Thr can be explained by route dualities (Fig 1A, 2C). All ofthem belong to the initial subset. The degeneracy 4 for Gly satisfy the route duality:

#1 −Gly − #3 ∼ #2 −Gly − #6.

The degeneracy 4 for Ala satisfy the route duality:

#2 − Ala − #8 ∼ #7 − Ala − #9.

The degeneracy 4 for Val satisfy the route duality:

#5 − Val − #26 ∼ #19 − Val − #24.

The degeneracy 4 for Pro satisfy the route duality:

#1 − Pro − #11 ∼ #10 − Pro − #12.

The degeneracy 4 for Thr satisfy the route duality:

#6 − Thr − #17 ∼ #16 − Thr − #18.

The degeneracy 2 for Glu, Asp, Cys, His, Gln, Phe, Tyr, Asn and Lys can be explained by pair connections(Fig 1A, 2C). They satisfy the following pair connections respectively: #4 − Glu − #23, #5 − Asp − #21,#9 − Cys − #18, #19 − His − #22, #20 − Gln − #28, #23 − Phe − #32, #24 − Tyr − #29, #26 − Asn − #30,#27 − Lys − #32. The degeneracy 3 for Ile and the degeneracy 1 for Met satisfies the route duality (Fig 1A,2C)

#21 − Ile − #30 ∼ #22 − Met/Ile − #29.

11


https://doi.org/10.1101/017988

The degeneracy 1 for Trp satisfies the pair connection for nonstandard genetic code #12 − Trp/stop(Trp) −#15. This pair connection includes a stop codon; the other stop codons satisfy the pair connection: #25 −stop − #31 (Fig 1A, 2C).

It is convenient to illustrate the symmetry of the roadmap by the cubes (Fig 2C). Some correspondencesof codon pairs among routes are as follows (Fig 2C):

Route 0 : UVW · wvu ↔ Route 1 : UVw ·Wvu

Route 0 : WUV · vuw ↔ Route 3 : wUV · vuW

Route 1 : WUV · vuw ↔ Route 2 : vuW · wUV

Route 2 : UVW · wvu ↔ Route 3 : VUW · wuv.

The feature of pair connections in the roadmap can be explained by the wobble pairings. The pair connectionsin the routes: Route 0 R strand, Route 2 R strand and Route 3 can be explained by the wobble pairing ruleU : G, A (Fig 2C); The pair connections in the routes: Route 0 Y strand, Route 2 Y strand and Route 1can be explained by the wobble pairing rule G : U,C (Fig 2C). By and large, there is a symmetry betweenRoute 0, 1 and Route 3, 2 in the roadmap (Fig 1A, 2C). Considering the expansion of the genetic code fromthe initial subset, the pair connections pertaining to Pro, S er and Leu in Route 0 Y strand are dual to the pairconnections in Route 3 Y strand, respectively (Fig 2B, 2C); the pair connections pertaining to Ala, Thr andVal in Route 1 Y strand are dual to the pair connections in Route 2 R strand, respectively (Fig 2B, 2C). Thecodon pairs can be divided into groups by the 6 facets of the cubes for the routes respectively: Facet u andFacet d by up and down; Facet w and Facet e by west (left) and east (right); Facet n and Facet s by north(back) and south (front) (Fig 2C). According to the route dualities, Route 0 and Facet u are dual to Route 3and Facet s; Route 0 and Facet d are dual to Route 3 and Facet n; Route 1 and Facet s are dual to Route 2and Facet u; Route 1 and Facet n are dual to Route 2 and Facet d (Fig 2C).

The roadmap not only explain the codon degeneracy for the 20 amino acids, but also discriminate the 5biosynthetic families of the amino acids. A GCAU genetic code table can be obtained by changing the orderU, C, A, G in the standard genetic code table to a new order G, C, A, U, where the third base are sorted by theorder G, A, C, U in considering the wobble pairings (Fig S2c2). The biosynthetic families of Glu, Asp, Val,S er and Phe situate in the continuous regions in the GCAU genetic code table, respectively (Fig S2c2). Thecodons as well as amino acids, in the recruitment orders, expand roughly from up to down, from left to rightin the GCAU genetic code table (Fig S2c2). So the evolution of the genetic code can be illustrated better inthe GCAU genetic code table than in the standard genetic code table. The clusters of the biosynthetic familiescan be explained by the roadmap. The codons are classified by the R strands and Y strands of the four routesas follows:

Route 0 R : RRR , Route 0 Y : YYY,

Route 1 R : RRY , Route 1 Y : RYY,

Route 2 R : RYR , Route 2 Y : YRY,

Route 3 R : YRR , Route 3 Y : YYR.

12


https://doi.org/10.1101/017988

In such a classification, the GCAU genetic code table can be divided into 8 × 4 blocks (Fig S2c1), whichconsequently influences the codon degeneracy and the clustering of biosynthetic families (Fig S2c2). Theregion of Route 0 R GRR and Route 1 GNY (N denotes purine or pyrimidine) relate to the biosyntheticrelationships among Gly, Ala, Glu, Asp, Val (Fig S2c1, S2c2). The region of Route 3 CNR explains thealignment of Arg, Pro, Gln that belong to the biosynthetic family of Glu (Fig S2c1, S2c2). The region ofRoute 1 ANY explains the alignment of Thr, Asn, Ile that belong to the biosynthetic family of Asp (Fig S2c1,S2c2). The region of Route 1 RNY relates to the biosynthetic relationships among Asp and Thr, Asn (FigS2c1, S2c2). The region of Route 3 R URR explains the correlations of positions among the 3 stop codons(Fig S2c1, S2c2). Moreover, the region of Route 0 R RRR can explain the biosynthetic relationships fromGlu to Lys; and ARG was assigned to Arg of the biosynthetic family of Glu (Fig S2c1, S2c2). The regionof Route 2 R GYR explains the biosynthetic relationships from Ala to Val (Fig S2c1, S2c2). The region ofRoute 3 Y YUR explains the degeneracy of Leu (Fig S2c1, S2c2).

The roadmap provides a framework beyond the explanation of the evolution of the genetic code. Accordingto the roadmap, the genetic code, tRNA, amino acid and aminoacyl-tRNA synthetase evolved together.Except the certain three positions of the codon, other positions also evolve along the roadmap (Fig S3a1).The evolution of the triplex nucleic acids along the roadmap might generate the corresponding sequenceencoding the aminoacyl-tRNA synthetase. The classification of the aminoacyl-tRNA synthetases distributedregularly in the roadmap. The aminoacyl-tRNA synthetases are divided into two classes Class II and Class I,and accordingly subclasses IIA, IIB, IIC; IA, IB, IC [10]. According to the recruitment order of theamino acids, the Class II aminoacyl-tRNA synthetases originated earlier than the Class I aminoacyl-tRNAsynthetases (Fig 2A, 2C). In the initial subset, #1, #2, #3, #5 Y , #6 are Class II aminoacyl-tRNA synthetases;Gly − tRNA appeared at #1 as the first Class II aminoacyl-tRNA synthetase (Fig 2C). #4, #5 R are Class I

aminoacyl-tRNA synthetases; Glu−tRNA appeared at #4 as the first Class I aminoacyl-tRNA synthetase (Fig2C). Majority of the Route 0, 1 are Class II aminoacyl-tRNA synthetases; majority of the Route 2, 3 areClass I aminoacyl-tRNA synthetases (Fig 2C). The exceptions mostly concern with the route duality betweenRoute 0 and Route 3 pertaining to S er, Leu, Pro, and route duality between Route 1 and Route 2 pertainingto Ala, Val, Thr (Fig 2C). The early Class II aminoacyl-tRNA synthetases recognize the variable arm of thetRNA. According to the explanation of the origin of the tRNA, the variable arm is complementary with theanticodon arm (Fig S2a3). Paracodons may be around the variable arm. The anticodons of the degeneracies4 or 6 are GGN, GCN, GUN, CGN, CCN, CUN, ACN and UCN. The aminoacyl-tRNA synthetases inthese codon boxes are IIA or IA (Fig S2c3). The aminoacyl-tRNA synthetases with anticodons of the smalldegeneracies are IIB, IIC, IB, IC (Fig S2c3). The aminoacyl-tRNA synthetases can also be divided intomajor groove (M) and minor groove (m) [11]. They distribute regularly in the roadmap (Fig 2C):

Route 0 Facet u : M , Route 0 Facet d : m;

Route 1 Facet s : M , Route 1 Facet n(Y/R) : M/m;

Route 2 Facet u(Y/R) : M/m , Route 2 Facet d(Y/R) : m/M;

Route 3 Facet s(Y/R) : m/M , Route 3 Facet n : m.

13


https://doi.org/10.1101/017988

The aminoacyl-tRNA synthetasis and the tRNA determines the pair connections and route dualities in theroadmap, which resulted in the codon degeneracy for amino acids and the 5 biosynthetic families of aminoacids (Fig 2C).

1.6 Origin of homochirality of life

Homochirality is an essential feature of life. In Part I of this paper, the origin of the homochirality of life isexplained by chiral symmetry breaking between the left-handed roadmap and the right-handed roadmap (Fig2D), which helps to explain the origin of the living system from the non-living system. This is analogous to(but not trivially related to) the spontaneous symmetry breaking in particle physics [12] [13], which helps toexplain the generation of masses from massless fields. The time for the existence of life on the Earth is in thesame order of magnitude of the age of the universe, which implies that life is an intrinsic phenomenon in theuniverse that originates spontaneously at the macromolecular level. Homochirality safeguards the stability ofthe living system, which has ensured the life to last comparable with the age of the universe.

The first recruited amino acid Gly is not chiral (Fig 1A #1). The following 19 amino acids are chiral (Fig1A #2). In the initiation stage of the roadmap, the nonchiral Gly helped to create the first pair connection#1 −Gly − #2, recruiting chiral Ala at #2 (Fig 1A). And the nonchiral Gly also helped to create the first routeduality in the roadmap (Fig 1A):

#1 −Gly − #3 ∼ #2 −Gly − #6.

This route duality played a central role in the initiation stage; consequently the initial subset played a centralrole in the expansion stage (Fig 2B). The chirality was required at the beginning of the roadmap by thetriplex DNA itself (Fig 1A, 1B). Even so, there was still a transition period from nonchirality to chirality, inconsideration of the special role of nonchiral Gly.

The complex evolution from Hierarchy 1 to Hierarchy 4 was a recurrent and cyclic process, with theintermittent chemical reaction conditions in environment. There were two possible chiral roadmaps in theevolution of the genetic code as follows (Fig 2D):

Right-handed chiral roadmap

from #1 to #32, right-handed triplex DNA, D-ribose, L-amino acids;

Left-handed chiral roadmap

from 1# to 32#, left-handed triplex DNA, L-ribose, D-amino acids.

The four bases themselves are not chiral, which are shared in common by the two chiral roadmaps. The twochiralities were selected randomly in the winner-take-all competition. The translation efficiency improvedgreatly as soon as all the 64 codons were recruited into the roadmap. First completed chiral roadmapdominated and was able to decompose the uncompleted opposite chiral system. Hence, there is no longerthe chance for the existence of the opposite chirality.

14


https://doi.org/10.1101/017988

The origin of the homochirality of life can be simulated based on the roadmap. The competition betweenthe right-handed chiral roadmap and the left-handed chiral roadmap can be simulated by a stochastic model.The selection of chirality of life was absolutely by chance with equal probability. The chiral roadmap evolvesfrom #1 to #32 or from 1# to 32# by a constance substitution probability step by step (Fig S2d1). Thestruggle for shared resources, such as the bases, might existed between the two chiral roadmaps (Fig S2d1).The model can estimate how many steps the winner pulled away from the loser in the competition of thetwo chiral roadmaps. The results of the estimations are as follows: around 3 steps gap in the case withoutstruggle for bases (Fig S2d1), and around 7 steps gap in the case with struggle for bases (Fig S2d1). In eithercase, the chiral roadmap of the loser had already finished the initiation stage and expansion stage according tothe simulations. Namely, it was in the ending stage that the two chiral roadmaps were separated (Fig S2d1).The recruitment of the last two stop codons in the ending stage may promote the separation of the two chiralroadmaps (Fig 1A, S2d1).

It is a great expectation to achieve the homochirality of life based on the roadmap for the experimentalists.Artificial conditions are more operational and efficient than the environmental changes. Proper proceduresand schedules are expected to considerably shorten the time to achieve the homochirality of life in laboratory.This is indeed a challenge.

1.7 Evidence for the roadmap

So far, a detailed description of the evolution of the genetic code has been proposed according to the roadmaptheory. In this subsection, numerous theoretical and experimental evidences are listed concretely that will beexplained here or later on.

The key experimental evidences in supporting of the roadmap are as follows. The roadmap itself is basedon the experimental results on the relative stabilities of the base triplexes. The delicate roadmap has narrowlyavoided the unstable base triplexes. The layout of the roadmap is, therefore, determined by the increasingstabilities of the base triplex along the roadmap (Fig Sla1). Genomic data provide the most successful andmost straightforward experimental evidence for the roadmap, which will be explained in detail in Section2. An evolutionary tree of codons is obtained based on the genomic codon distributions, where the fourhierarchies of the roadmap are distinguished clearly (Fig 3A). This is a straightforward evidence to supportthe roadmap theory. Coincidentally, an evolutionary tree of species can also be obtained based on the samegenomic codon distributions and the roadmap (Fig S3d8, S3d9), which agrees with the tree of three domains.The successful explanation of the origin of the three domains also conversely supports the roadmap theory.Another strong and heuristic evidence for the roadmap is that the origin of the homochirality of life canbe explained by the roadmap straightforwardly (Fig 2D). If the homochirality of life can be achieved bythe roadmap in experiment, the roadmap will be conclusively confirmed. The origin of the genetic code,the origin of the homochirality of life and the origin of the three domains can be explained together in theroadmap picture (Fig 1, 2, 3). These three theories support one another.

15


https://doi.org/10.1101/017988

Furthermore, the roadmap is supported by the following evidences. The roadmap predict the recruitmentorder of the 64 codons, the recruitment order of the 3 stop codons and the recruitment order of the 20 aminoacids. Evaluation of the three recruitment orders may verify the roadmap theory.

The recruitment order of the 32 codon pairs can be obtained from #1 to #32 by the roadmap (Fig 2B,S2b1). The declining GC content indicates the evolution direction because of the substitutions from G to A,from G to C and from C to T in the roadmap (Fig S2b1). According to the roadmap, the total GC content andthe position specific GC content for the 1st, 2nd and 3rd codon positions are calculated for each step from #1to #32. Then the relationship between the total GC content and the position specific GC content is obtained(Fig S2b2). The 1st position GC content is higher than the 2nd position GC content. And the 3rd position GC

content declined rapidly from the highest to the lowest in the evolution direction when the total GC contentdeclines. The GC content variation in the simulation agrees with the observation (to compare Fig S2b2 withFigure 2 in [14] and Figure 5 in [15]), so the recruitment order obtained by the roadmap is reasonable. Therecruitment order of the codons by the roadmap generally agree with the order in [4] [16] [17]. It shouldbe noted that the roadmap itself was conceived firstly by studying the substitution relationships based on therecruitment order in [4], which was furthermore improved in accordance with the recruitment order of aminoacids.

The recruitment order for the three stop codons are #15 UGA, #25 UAG, #31 UAA (Fig 1A, 2B), whichresults in the different variations of the stop codon usages. Along the evolution direction as the declining GC

content, the usage of the first stop codon UGA decreases; the usage of the second stop codon UAG remainsalmost constantly; the usage of the third stop codon UAA increases (Figure 1 in [18]). The observations ofthe variations of stop codon usages can be simulated (to compare Fig S2b3 with Figure 1 in [18]) accordingto the recruitment order of the codon pairs (Fig 2B, S2b1) and the variation range of the stop codon usages.Especially, the detailed features in observation can be simulated that the usage of UGA jump downwardsgreatly; UAA, upwards greatly, around half GC content (to compare Fig S2b3 with Figure 1 in [18]).

The recruitment order of the 20 amino acids from No.1 to No.20 can be obtained by the roadmap (Fig 2B,S2b1), which meets the basic requirement that Phase I amino acids appeared earlier than the Phase II aminoacids [19] [20]. The species with complete genome sequences are sorted by the order R10/10 according to theiramino acid frequencies, where the order R10/10 is defined as the ratio of the average amino acid frequenciesfor the last 10 amino acids to that for the first 10 amino acids [21]. Along the evolutionary direction indicatedby the increasing R10/10, the amino acid frequencies vary in different monotonous manners for the 20 aminoacids respectively (Fig S2b4). For the early amino acids Gly, Ala, Asp, Val, Pro, the amino acid frequenciestend to decrease greatly, except for Glu to increase slightly (Fig S2b4); for the midterm amino acids S er, Leu,Thr, Cys, Trp, His, Gln, the amino acid frequencies tend to vary slightly, except for Arg to decrease greatly(Fig S2b4); for the late amino acids Ile, Phe, Tyr, Asn, Lys, the amino acid frequencies tend to increasegreatly, except for Met to increase slightly (Fig S2b4). In the recruitment order from No.1 to No.20, thevariation trends of the amino acid frequencies increase in general (Fig S2b5); namely, the later the amino acidsrecruited, the more greatly the amino acid frequencies tend to increase (Fig S2b4, S2b5). The recruitment

16


https://doi.org/10.1101/017988

order of the amino acids from No.1 to No.20 is supported not only by the previous roadmap theory but alsoby this pattern of amino acid frequencies based on genomic data, so it help to resolve the disputes on such anissue in the literatures [22].

17


https://doi.org/10.1101/017988

2 Origin of the three domains

2.1 Genomic codon distributions reveal the relationship between the evolution of thegenetic code and the tree of life

In the previous section, the roadmap theory was proposed, as a hypothesis, to explain the evolution of thegenetic code and the origin of the homochirality of life (Fig 1, 2). In this section, more experimental evidencesare provided to enhance and reinforce the roadmap theory (Fig 3). Especially, the four hierarchies in theroadmap are directly distinguished in a tree of codons obtained by studying the genomic codon distributions(Fig 3A). And the origin of the three domains of life can be explained based on the roadmap and the genomicdata (Fig 3D). In short, the biodiversity originated in the evolution of the genetic code, and has flourished bythe environmental changes.

The viewpoint that the tree of life was rooted in the recruitment of codons can be proved by analysingthe complete genome sequences. The genomic codon distribution is a powerful method to study the statisticalproperties of the complete genome sequences, which can reveal the profound relationship between the primordialgenetic code evolution and the contemporary tree of life (Fig 3A, S3a5). The genomic codon distribution fora species with complete genome sequence is defined as follows (Fig S3a3): first, mark all the positions foreach of the 64 codons in its complete genome sequence respectively; second, calculate all the codon intervalsbetween the adjacent marked positions for each of the 64 codons respectively, and obtain the distributions ofcodon intervals according to the 64 sets of codon intervals respectively (some codon intervals count more, andthe others less), where a cutoff 1000 is used when counting the number of common intervals; third, obtain thegenomic codon distribution by normalising the 64 distributions of codon intervals. Normalisation eliminatesthe direct impact of a wide range variation of genome sizes.

Both the tree of codons (Fig 3A) and the tree of species (Fig S3a5) are obtained according to the same setof correlation coefficients among the genomic codon distributions. 952 complete genomes in GeneBank (upto 2009) were used in the calculations. There are 64 genomic codon distributions for each of the 952 species.On one hand, the average correlation coefficient matrix for codons is obtained by averaging these 64 × 64correlation coefficient matrices for the 952 species; hence the tree of codons is obtained based on the 64 × 64distance matrix that equals to half of one minus the average correlation coefficient matrix for codons (Fig3A); on the other hand, the average correlation coefficient matrix for species is obtained by averaging these952 × 952 correlation coefficient matrices for the 64 codons, hence the tree of species is obtained based onthe 952 × 952 distance matrix that equals to half of one minus the average correlation coefficient matrix forspecies (Fig S3a5). The codons from the four hierarchies Hierarchy 1−4 in the roadmap gather together in thetree of codons respectively, where the Hierarchy 1 and Hierarchy 2 situated in one branch, and Hierarchy 3and Hierarchy 4 in another branch (Fig 3A). This is a direct evidence to support the roadmap (Fig 1A, 3A).According to the NCBI taxonomy, the tree of species roughly presents the phylogeny of species (Fig S3a5).Especially, the tree of the eukaryotes with complete genomes is reasonable to some extent (Fig S3a6). In

18


https://doi.org/10.1101/017988

Subsection 2.3, an improved tree of species (Fig S3a8, S3a10) will be obtained by dimension reduction forthe 64 × 952 genomic codon distributions according to the roadmap (Fig S3d2, 3D).

The validity of the method of genomic codon distribution can be explained in the roadmap picture.Whereas the more concerns were given to the three-base-length segments of the triplex DNA in the previoussection, base substitutions can occur in any places through the whole length of the triplex DNA (Fig S3a1).Starting from the initial Poly C · poly G ∗ poly G, several codons in different regions of the triplex DNA canevolve independently along the roadmap (Fig S3a1). In this way, the triplex DNA evolves gradually fromthe initial identical sequences to the final non-identical sequences (Fig S3a1). The identical sequences tendto form a triplex DNA, but it becomes more and more difficult for the non-identical sequences to form atriplex DNA. So there was a transition from triplex DNA to double DNA after the genetic code evolution.The evolution of the whole sequence may result in varieties of distributions of codons in the genomes (e.g.,T AA in the last strand in Fig S3a1). The genomic codon distributions, therefore, resulted from the evolutionof the genome sequences. Thus, the biodiversity at the species level actually originated at the sequencelevel. The genomic codon distributions keep the essential features of species at the sequence level. Thegenomic codon distributions were reluctant to change, in the context of normalisation, during the evolutionof genome sizes mainly promoted by large scale duplications (Fig S3a4). In the long history of evolution,even if genome sizes have been doubled for many times, the essential features in the ancestors’ sequencescan be inherited generation by generation due to the relative invariant of the genomic codon distributions. Asa result, the ancient information on the evolution of the genetic code and the origin of the three domains canbe dug out from the complete genome sequences of contemporary species. It should be emphasised that thesequences evolved based on the roadmap were by no mean random, which may account for the linguistics inthe biological sequences.

According to the roadmap, the initial subset took shape in the initiation stage of the evolution of thegenetic code (Fig 1A), where the GC content of the initial subset was 13/18; leaf nodes appeared in theending stage (Fig 1A), where the GC content of the leaf nodes decreased to 16/48 (Fig S3a1). The variationrange of GC contents from 16/48 to 13/18 obtained by the roadmap generally agrees with the observationsof GC contents for contemporary species (Fig S3a1, S3a2). Along with the evolution of the whole sequences(Fig S3a1), the shorter triplex DNAs only accommodate a proper subset of the genetic code; the longertriplex DNAs accommodate almost all the codons. Such a difference will be considered in Subsection 2.2to explain the different features of genomic codon distributions between Eukaryotes and Bacteria etc. Notethat Eukaryotic genomes mainly consists of non-coding DNA, the tree of eukaryotes based on their genomiccodon distributions is, therefore, mainly depended on the non-coding DNA (Fig S3a6, S3d10). This indicatesthat the non-coding DNA has played a substantial role in the evolution of eukaryotes.

19


https://doi.org/10.1101/017988

2.2 Features of genomic codon distributions: observations and simulations

The features of genomic codon distributions are different among Virus, Bacteria, Archaea and Eukarya (Fig3B, S3b2), which are also called the (3 + 1, including Virus) domains for convenience. The genomic codondistribution for a domain is obtained by averaging the genomic codon distributions of all the species inthis domain (only considering those with complete genome sequences, in practice) (Fig 3B). The featureof genomic codon distribution for a species is similar to that of the corresponding domain (Fig 3B, S3b2).The method based on the genomic codon distribution apply not only for cellular life but also for viruses.Explanation of the features of the genomic codon distributions (Fig 3B, S3b3) is helpful to understand theorigin of the three domains and to clarify the status of virus.

The features of genomic codon distributions for the domains can be described respectively from twoaspects: first, the periodicity and the amplitudes of the genomic codon distributions; second, the order ofthe average distribution heights for 64 codons. (i) The features of genomic codon distribution for Virusinclude declining three-base-periodic fluctuations, high amplitudes (Fig 3B, S3b2). In addition, the orderof the average distribution heights is: Hierarchy 1, highest; followed by Hierarchy 2 and Hierarchy 3;and Hierarchy 4, lowest (Fig S3c1). (ii) The features of genomic codon distribution for Bacteria includedeclining three-base-periodic fluctuations, medium amplitudes (Fig 3B, S3b2). In addition, the order of theaverage distribution heights is: Hierarchy 1, highest; followed by Hierarchy 2, Hierarchy 3 and Hierarchy 4(Fig S3c1). (iii) The features of genomic codon distribution for Archaea include declining three-base-periodicfluctuations, medium amplitudes (Fig 3B, S3b2). In addition, the order of the average distribution heights isroughly opposite to that of Bacteria: Hierarchy 4, highest; followed by Hierarchy 3 and Hierarchy 2; andHierarchy 1, lowest (Fig S3c1). Comparing Archaea with Bacteria, the dwarf 2nd peak at 6-base intervalin the genomic codon distribution is even lower than the 3rd peak at 9-base interval (Fig 3B, S3b2). Thisdetailed feature is different from Bacteria. For Bacteria, the 2nd peak is higher than the 3rd peak for codonsin Hierarchy 1 and the heights are almost the same for the 2nd and 3rd peaks for other codons (Fig 3B, S3b2).(iv) The features of genomic codon distribution for Eukarya are more complicated than for other domains,which include declining distributions, indistinct multi-base-periodic fluctuations, ups and downs of peaks,or overlapping peaks for some codons (Fig 3B, S3b2). In addition, the order of the average distributionheights is similar to that of Archaea: Hierarchy 4, highest; followed by Hierarchy 3 and Hierarchy 2;and Hierarchy 1, lowest (Fig S3c1). The features of genomic codon distribution for Archaea put togetherthe features of both Archaea and Eukarya (Fig 3B, S3b2). Crenarchaeota and Euryarchaeota are similar ingenome size distributions (Fig 3B, S3b1). It is showed that the tree of the three domains is more reasonablethan the eocyte tree of life, according to the features of genomic codon distributions (Fig 3B, S3b1).

The origin order for the domains can be obtained according to the linear regression plot whose abscissaand ordinate represent the recruitment order of codon pairs and the order of average distribution heightsrespectively (Fig 3C, S3c1). The greater the slope of the regression line for a domain is, the more the laterecruited codons are. So, the order of the slopes of the regression lines represents the origin order of the

20


https://doi.org/10.1101/017988

domains: 1 Virus, 2 Bacteria, 3 Archaea, 4 Eukarya (Fig 3C). The slope for Virus is negative and least (Fig3C), which means that majority of codons in the sequences recruited early. The virus originated earlier thanthe three domains. The slope for Bacteria is negative (Fig 3C). The proportion for Hierarchy 1 is high; thatfor Hierarchy 2 − 4, lower. The slope for Archaea is positive (Fig 3C), which means that majority of codonsin the sequences recruited late. The slope for Eukarya is positive and greatest (Fig 3C), which also means thatmajority of codons in the sequences recruited late. The Eukarya originated latest. Archaea originated betweenBacteria and Eukarya (Fig 3C). Limited data indicate that the slope for Crenarchaeota is less than the slopefor Euryarchaeota (Fig 3C), which suggests that Crenarchaeota may originate early than Euryarchaeota. Thisorigin order of the domains is independent from the base compositions for the domains, which are differentfrom the order of the domains by GC content (Fig S3a2).

The features of the genomic codon distributions for the domains can be simulated by a stochastic model.The features of declining three-base-periodic fluctuations can be simulated by the model (Fig 3B, S3b3).In the model, the bases G, C, A, T are given in certain base compositions, from which three bases arechosen statistically to make up a codon. Many a codon concatenates together to form a genome sequencein simulation. Hence, the genomic codon distribution is obtained by calculating the simulated genomesequence. When the length of the triplex DNA increased in the evolution along the roadmap (Fig S3a1),more and more codons are accommodated into the triplex DNA. The feature of declining three-base-periodicfluctuations, therefore, is achieved in the simulations. Let n0 denote the number of codons accommodated inthe triplex DNA. n0 codons are generated statistically to concatenate the simulated genome in several cycles.The genomic codon distribution thus obtained has the feature of declining three-base-periodic fluctuations.When n0 is small (e.g., around 16), the amplitude of the three-base-periodic fluctuations is great. Thegreater n0 is, the less the amplitude of the three-base-periodic fluctuations is. When n0 is 32 (namely allcodons participated in the concatenation), the three-base-periodic fluctuations vanish. The amplitude of thethree-base-periodic fluctuations decreases from Virus to Bacteria, then Archaea, finally Eukarya (Fig 3B,S3b3), which results from the increasing number of codons accommodated in the triplex DNA at differentstages in the evolution of the genetic code. The complicated features of genomic codon distributions foreukaryotes can also be simulated by the model. The features of ups and downs of peaks for eukaryotes canbe simulated by concatenating different set of codons that are generated from different subsets of the geneticcode (Fig S3b3). The alternative ”high peak - low peak” feature for some codons and the low amplitudesof the genomic codon distributions for other codons agree with the observations for true eukaryotes (Fig 3B,S3b2, S3b3).

The domains differ in the orders of the average distribution heights, which can also be simulated by thestatistical model (Fig S3c2). The GC content tends to decrease along the evolutionary direction accordingto the roadmap (Fig 1A, S2b2), which accounts for the above different orders among the domains. Differentbase compositions for G, C, A, T are assigned for Virus, Bacteria, Archaea and Eukarya in the simulations,in consideration of the decrement of GC content effected by the origin order of the domains. The simulationsof genomic codon distributions thus obtained for the domains (Fig S3b3) differ in the corresponding ordersof the average distribution heights (Fig S3c2). If high GC contents are assigned to Virus or Bacteria in the

21


https://doi.org/10.1101/017988

simulation, the average distribution heights from high to low are as follows: Hierarchy 1, Hierarchy 2,Hierarchy 3, and Hierarchy 4 (Fig S3c2). If low GC contents are assigned to Archaea or Eukarya in thesimulation, the average distribution heights from high to low are as follows: Hierarchy 4, Hierarchy 3,Hierarchy 2, and Hierarchy 1 (Fig S3c2). The simulations generally agree with the observations as to theorder of the average distribution heights (Fig S3c1, S3c2).

Incidentally, the validity of the composition vector tree (a method to infer the phylogeny from genomesequences) in the literatures [23] [24] can be explained based on the genomic codon distribution in this paper(Fig S3a3). The average distribution heights for the 64 codons are proportional to the numbers of codonsin the genomes. The summations of the elements of the unnormalized genomic codon distributions for aspecies represent the numbers of the 64 codons in the genome. So a set of average distribution heights isequivalent to the K = 3 composition vector for this species. For example, the summations of the elements ofthe unnormalized genomic codon distributions of GCA and GCA are 32 and 38 respectively, which means thatthere are 33 GCA and 39 T AT in the genome (Fig S3a3). Hence, the K = 3 composition vector for this speciesis obtained, where the elements corresponding to GCA and T AT are 33 and 39 respectively (Fig S3a3). TheK = 3 composition vector (Fig S3a3) in fact corresponds to the order of the average distribution heights (FigS3c1). It has been emphasised that the genomic codon distributions are essential features of species at thesequence level. The composition vector, as a deformed method of the genomic codon distribution, is certainlyvalid to infer the phylogeny.

2.3 The evolution space and the tree of life

According to the roadmap theory and the features of genomic codon distributions, an evolution space (Fig3D) has been constructed to explain the major and minor branches of the tree of life (Fig S3d8, S3d9). Theevolution space is 3-dimensional, the coordinates of which are fluctuation, route bias and hierarchy bias (Fig3D). (i) The fluctuation is defined by the total summation of the absolute value of the neighbouring differencesof the fluctuations in the 64 genomic codon distributions for a species. (ii) The route bias is defined by thedifference of the average genomic codon distributions between Route 0 and Route 1 − 3 (Fig S3d2). (iii) Thehierarchy bias is defined by the difference of the average genomic codon distributions between Hierarchy 1−2and Hierarchy 3 − 4 (Fig S3d2).

It should be pointed out that the domains cannot easily be distinguished into clusters by the bias similarlydefined on difference of average genomic codon distributions between arbitrarily divided two groups ofcodons. So, the definitions of route bias and hierarchy bias themselves reveal the essential features of thegenetic code. After a lot of groping around, it was realised that only proper groupings of codons by theroadmap are able to distinguish the domains (Fig 3D). When defining route bias and hierarchy bias underthe major premise of clusterings of the three domains, the recruitment order of codons and the roadmapsymmetry are also in consideration (Fig S3d1, S3d2). The genome size distributions can be divided into 20groups according to the 20 combinations of the 4 bases (Fig S3d1). The number 20 of combinations is equal

22


https://doi.org/10.1101/017988

to the number of canonical amino acids. And the 20 combinations generally correspond to the 20 tRNA inSubsection 1.4, among which 14 combinations correspond to certain tRNAs respectively (Fig S3d1, S2a3).So, the classifications of genome size distributions are of biological significance.

The three coordinates of the evolution space are obtained by calculations for each species, thus a speciescorresponds to a point in the evolution space (Fig 3D). The species gather together in different clusters forVirus, Bacteria, Archaea and Eukarya (Fig 3D, S3d3, S3d4, S3d5). The cluster distributions of taxa oflower rank are also observed in detail in the evolution space (Fig 3D, S3d3, S3d4, S3d5). The evolutionaryrelationships among species are thereby illustrated in the evolution space, which reveals the close relationbetween the origin of biodiversity and the origin of the genetic code. According to the fluctuations of genomiccodon distributions, the order of domains is as follows: first Virus, then Bacteria and Archaea, and lastEukarya (Fig S3d4, S3d5). Bacteria and Archaea are distinguished roughly by route bias (Fig S3d3, S3d5),and they can be distinguished better by both hierarchy bias and route bias (Fig 3D, S3d3). Diversification ofthe domains is, therefore, due to route bias and hierarchy bias at the sequence level. In addition, the leaf nodebias can be defined by the difference between the average genomic codon distributions of the leaf nodes inRoute 0−1 and that of the leaf nodes in Route 2−3 (Fig 1A, S3d2). Hence, Crenarchaeota and Euryarchaeotaare distinguished by leaf node bias (Fig S3d6, S3d7). Diversification of Crenarchaeota and Euryarchaeota is,therefore, due to leaf node bias at the sequence level. The cluster distributions of the domains in the evolutionspace shows that the biodiversity originated in the evolution of the genetic code at the sequence level. Andthe effects from the earlier stage of the evolution of the genetic code contribute more to the tree of life.

A tree of life (Fig S3d8) can be obtained based on the distances among species in the evolution space (Fig3D). The three coordinates of the evolution space need to be equally weighted by non-dimensionalization ofthe three coordinates through dividing by their standard deviations respectively first. Then a distance matrixis obtained by calculating the Euclidean distances among species in the non-dimensionalized evolution space.Thus, the tree of life is constructed by PHYLIP software based on the distance matrix (Fig S3d8). Note thatall the phylogenetic trees in this paper are constructed by PHYLIP software using UPGMA method [25].The tree of life based on the evolution space is reasonable to describe the evolutionary relationships amongspecies. Take the eukaryotes for example. Such a tree provides reasonable inclusion relationships amongvertebrates, mammals, primates, where chimpanzee is nearest to human beings. And yeast indicates its root(Fig S3d10).

A tree of taxa can also be obtained according to the average distances of species among Virus, Bacteria,Crenarchaeota, Euryarchaeota and Eukarya in the non-dimensionalized evolution space (Fig S3d9). The treeof taxa shows that the relationship between Crenarchaeota and Euryarchaeota is nearer than the relationshipbetween Euryarchaeota and Eukarya (Fig S3d9). The tree of taxa based on the evolution space agrees withthe three domain tree rather than the eocyte tree [26] [27] [28] [29] [30]. This result is accordant with theobservations that the features of genomic codon distributions between Crenarchaeota and Euryarchaeota arealso more similar than that between Euryarchaeota and Eukarya in Subsection 2.2.

23


https://doi.org/10.1101/017988

Life originated in the chiral chemistry in the roadmap scenario. The basic pattern of the genetic codedetermines the main branches of the tree of life, so biodiversity is an inborn feature of life. Thus the primordialecosystem originated from the evolution of the genetic code, and the living system is a natural expansion of theprimordial ecosystem. The living system of many species can adapt to the non-chiral surroundings throughthe birth and death of diverse individuals in the ecological cycle; however a system with only one speciesnever has had chance to survive in the non-chiral surroundings. Hence the last universal common ancestorbecomes unnecessary in the theories of evolution. And the phylogenetic relationships among species oughtto form a circular network rather than a tree.

24


https://doi.org/10.1101/017988

3 Reconstruction of the Phanerozoic biodiversity curve

3.1 Explanation of the trend of the Phanerozoic biodiversity curve based on genomesize evolution

Evolution has taken place and is continuing to occur at both the species level and the sequence level. Historyarchives of biodiversity on the earth have been stored not only in the fossil records at the species level [31] [32][33] but also in the genomes of contemporary species at the sequence level. A Phanerozoic biodiversity curvecan be plotted based on the fossil records [32] (Fig 4C, S4a8). The growth trend of this curve is exponential[32] (Fig 4C, S4a8), and the deviations from the exponential trend represent extinctions and originations [32](Fig 4B, S4a8). Environmental changes substantially influence the evolution of life at the species level, butdo not directly impact on the molecular evolution. Genome sizes are the most coarse-grained representationsof species at the sequence level. So, the evolution of life can be outlined by the evolution of genome size.And the genome size increased mainly via large scale duplications. Additionally considering the relativeinvariance of the genomic codon distributions in the duplications (Fig S3a4), the features of species at thesequence level do not change in general through genome size evolution. So, it is relatively independentbetween the evolution at the species level and the evolution at the sequence level. In this paper, a split andreconstruction scheme is proposed to explain the Phanerozoic biodiversity curve (Fig 4C, S4a8, S4c1): theexponential growth trend of the curve is postulated to be driven by the genome size evolution (Fig 4A), whilethe extinctions and originations in the fluctuations of the curve is postulated due to the Phanerozoic eustaticand climatic changes (Fig 4B).

It is necessary to shed light on the genome size evolution. It will be shown that the growth trend ofgenome size evolution is also exponential in general [34] [35]. The genome size distributions of species incertain taxa generally satisfy logarithmic normal distributions, according to the statistical analysis of genomesizes in the databases Animal Genome Size Database [36] and Plant DNA C-values Database [37] (Fig S4a1).The genome size distributions for the 7 higher taxa of animals and plants, namely, diploblastica, protostomia,deuterostomia, bryophyte, pteridophyte, gymnosperm and angiosperm, generally satisfy logarithmic normaldistributions, respectively. The genome size distribution for all the species in the two databases satisfieslogarithmic normal distribution very well, due to the additivity of normal distributions (Fig S4a1).

The logarithmic normal distributions of genome sizes for taxa can be simulated by a stochastic model (FigS4a2). Based on the duplication mechanism, the genome size evolution can be simulated by the stochasticmodel. The genome sizes obtained at a certain time in the simulation represent the genome sizes of speciesin a taxon at that time, which satisfy logarithmic normal distribution. The simulation indicates that, along thedirection of time, the logarithmic mean of genome sizes for a taxon GlogMean increases, and the logarithmicstandard deviation of genome sizes for a taxon GlogS D also increases (Fig S4a2). According to the simulation,the logarithmic mean of genome sizes for the ancestor of a taxon at the origin time GorilogMean can be estimatedby the genome sizes at present. Details are as follows. When genome size evolves in an exponential growth

25


https://doi.org/10.1101/017988

trend, GorilogMean at the origin time is always less than GlogMean at present for a certain taxon. The greaterGlogS D is (e.g. GB

logS D > GAlogS D in Fig S4a2), the earlier the origin time is (e.g. tB

origin < tAorigin in Fig S4a2), and

thereby the less GorilogMean is (e.g. GBorilogMean < GA

orilogMean in Fig S4a2). Thus, the genome size for ancestor atthe origin time GorilogMean is obtained by calculating the genome sizes at present: GlogMean minus GlogS D timesan undetermined factor χ (Fig S4a2, S4a7).

Based on the genome sizes of contemporary species in the databases, the logarithmic means of genomesizes GlogMean and the logarithmic standard deviations of genome sizes GlogS D are obtained for the 7 highertaxa of animals and plants, respectively (Fig S4a7). The genome sizes for the ancestors of the 7 higher taxaat their origin times GorilogMean are less than the corresponding GlogMean at present, whose differences can beestimated by the corresponding GlogS D times an undetermined factor χ, respectively (Fig S4a7). The trend ofthe genome size evolution cannot be worked out just only based on the genome sizes of contemporary species.Furthermore, a chronological reference has to be introduced here. The origin times for taxa can be estimatedaccording to the fossil records. In this paper, the origin times for the 7 higher taxa are assumed as followsrespectively: diploblastica, 560 Ma; protostomia, 542 Ma (Ediacaran-Cambrian); deuterostomia, 525 Ma;bryophyte, 488.3 Ma (Cambrian-Ordovician); pteridophyte, 416.0 Ma (Silurian-Devonian); gymnosperm,359.2 Ma (Devonian-Carboniferous); angiosperm, 145.5 Ma (Jurassic-Cretaceous) [38] [39] [40]. The mainresults in this paper cannot be affected substantially by the disagreements among the origin times in theliteratures or the expansion of the databases. A rough exponential growth trend of genome size evolutionis observed in the plot whose abscissa and ordinate represent the origin time and GlogMean respectively (FigS4a7), where the differences between the genome sizes at the origin time and that at present are not consideredyet. A more reasonable plot is introduced by taking GorilogMean rather than GlogMean as the ordinate (Fig S4a7).In the calculation, GorilogMean are obtained respectively for the 7 higher taxa by (Fig S4a7)

GorilogMean = GlogMean − χ ·GlogS D.

According to the regression analysis, a regression line approximates the relationship between the origin timeand GorilogMean (Fig S4a7). Let the intercept of the regression line be the logarithmic mean of genome sizesfor all the species in the databases, then the undetermined factor is χ = 1.57 (Fig S4a7). The slope of theregression line represents the exponential growth speed of genome size evolution, which was doubled foreach 177.8 Ma (Fig S4a7, 4A). It does not violate common sense that the genome size at the time of theorigin of life about 3800 Ma is about several hundreds base pairs, according to the regression line.

There are more evidences to support the method for calculating the ancestors’ genome size based on thedata of contemporary species. The values of GlogMean, GlogS D and GorilogMean are obtained for the 19 animaltaxa or for the 53 angiosperm taxa, respectively (Fig S4a3). The numbers of species in these taxa are greatenough for statistical analysis. The chronological order for the 19 animal taxa is obtained by comparingGorilogMean (Fig S4a5). The order of the three superphyla is suggested to be: 1 diploblastica, 2 protostomia, 3deuterostomia, which can be inferred from the chronological order of the 19 animal taxa and the superphylumaffiliation of the 19 taxa (Fig S4a5). This result supports the three-stage pattern in Metazoan origination basedon fossil records [41] [42] [43] [44] [45] [46] [47] [48]. The chronological order for the 53 angiosperm taxa

26


https://doi.org/10.1101/017988

is also obtained by comparing GorilogMean (Fig S4a6), which confirms the common sense that dicotyledoneaeoriginated earlier than monocotyledoneae. Let the abscissa and ordinate represent respectively GorilogMean

and GlogMean for the 7 higher taxa and 19 + 53 taxa, where the variation ranges of logarithm of genomesizes GlogMean ± GlogS D were also indicated on the ordinate. The upper triangular distribution of the genomesize variations thus obtained (Fig S4a4) reveals intrinsic features of the genome size evolution. The greaterGorilogMean is, the greater GlogMean is, and the less GlogS D is (Fig S4a4). The upper vertex of the triangle indicatesapproximately the upper limit of genome sizes of contemporary species. The upper triangular distribution(Fig S4a4) can be explained by the simulation of genome size evolution (Fig S4a2). The simulation showsthat GorilogMean and GlogMean increase along the evolutionary direction; the earlier the time of origination is,the greater GlogS D is (Fig S4a2, S4a4).

The exponential growth trend of Phanerozoic biodiversity curve can be explained by the exponentialgrowth trend of genome size evolution. Bambach et al. obtained the Phanerozoic biodiversity curve basedon fossil records, which satisfies the exponential growth trend (Fig 4C). The number of genera was doubledfor each 172.7 Ma in the Phanerozoic eon (Fig 4A). The genome size was doubled for each 177.8 Ma in thePhanerozoic eon (Fig 4A). The exponential growth speed for the number of genera agrees with the exponentialgrowth speed for the genome size. It is conjectured that the genome size evolution at the sequence level drivesthe evolution of life at the species level.

Note that Bambach et al.’s curve is chosen as the Phanerozoic biodiversity curve in this paper. It is basedon the following considerations. Based on fossil records, Sepkoski obtained the Phanerozoic biodiversitycurve at family level for the marine life first [49], then obtained the curve at genus level [50]. Sepkoski did notinclude single fossil records in order to reduce the impact on the curve at genus level from the samplings andtime intervals. In order to reduce such an impact furthermore, Bambach et al. chose only boundary-crossingtaxa in construction of the Phanerozoic biodiversity curve at genus level [32]. Rohde and Muller obtained acurve for all genera and another curve for well-resolved genera (removing genera whose time intervals areepoch or period, or single observed genera) respectively based on Sepkoskis Compendium, which can roughlybe taken as the upper limit and lower limit of the Phanerozoic curve (Fig 4C). These Phanerozoic biodiversitycurves are roughly in common in their variations, from which the five mass extinctions O-S, F-F, P-Tr, Tr-J,K-Pg can be discerned. There are several models to explain the growth trend of the Phanerozoic biodiversitycurve. Sepkoski and other co-workers tried to explain the growth trend by 2-dimensional Logistic model [51][52] [53]. Benton et al. explained it by exponential growth model [54]. The Bambach et al.’s curve is lowerin the Paleozoic era, and obviously higher in the Cenozoic era than Sepkoski’s curve. It is more obvious thatthe Bambach et al.’s curve is in exponential growth trend than Sepkoski’s curve.

The exponential growth trend and the net fluctuations of the Bambach et al.’s biodiversity curve can beexplained at both the molecular level and the species level respectively by a split and reconstruction scheme(Fig S4a8). In this scheme, the exponential growth trend of biodiversity is attributed to the driving force fromthe genome size evolution, in view of the approximate equality between the growth speed of number of generaand the growth speed of genome size (Fig 4A). The homochirality of life, the genetic code and the tree of life

27


https://doi.org/10.1101/017988

are generally invariant with the duplications of genome sizes. It maintains the features of life at the sequencelevel. Expansion of genome size provides conditions at the sequence level for the innovation of biodiversityat the species level. According to the split and reconstruction scheme, the extinctions and originations bythe deviations from the exponential growth trend of the biodiversity curve is interpreted as a biodiversitynet fluctuation curve. An unnormalised biodiversity net fluctuation is obtained by subtracting the growthtrend from the logarithmic Phanerozoic biodiversity curve (Fig S4a8), and the biodiversity net fluctuationcurve (Fig 4B, S4b5) is defined by dividing its standard deviation from the biodiversity net fluctuation curve.The biodiversity net fluctuation curve can be explained by the Phanerozoic eustatic curve and Phanerozoicclimatic curve at the species level in the next subsection (Fig 4B).

3.2 Explanation of the variation of the Phanerozoic biodiversity curve based onclimatic and eustatic data

According to the split and reconstruction scheme for the Phanerozoic biodiversity curve, the growth trend ofbiodiversity is attributed to the genome size evolution at the sequence level, and the net fluctuations deviatedfrom the growth trend is attributed to the couplings among earth’s spheres. The history of biodiversity in thePhanerozoic eon can be outlined by the biodiversity net fluctuation curve (Fig 4B): the curve reached the highpoint after the Cambrian radiation and the Ordovician radiation; then the curve reached the low point after atriple plunge of the O-S, F-F and P-Tr mass extinctions; and afterwards the curve climbed higher and higherexcept for mass extinctions and recoveries around Tr-J and K-Pg. The biodiversity net fluctuation curve canbe reconstructed through a climato-eustatic curve based on climatic and eustatic data (Fig 4B), which is freefrom the fossil records. Accordingly, it is advised that the tectonic movement can account for the five massextinctions.

Numerous explanations on the cause of mass extinctions have been proposed, such as climate change,marine transgression and regression, celestial collision, large igneous province, water anoxia, ocean acidification,release of methane hydrates and so on. According to a thorough study of control factors on mass extinctions,it is almost impossible to explain all the five mass extinctions by a single control factor. Although the O-Smass extinction may be attributed to the glacier age, the other mass extinctions almost happened in hightemperature periods. The other glacier ages in the Phanerozoic eon had always avoided the mass extinctions.Although the mass extinctions tend to happen with marine regressions, it is commonly believed that no marineregression happened in the end-Changhsingian, the greatest mass extinction [55] [56] [57] [58] [59]. Largeigneous province may play roles only in the last three mass extinctions [60] [61]. The K-Pg mass extinctionwas most likely to be caused by celestial collision, but there are still significant doubts on this explanation[62] [63] [64]. According to the high resolution study of the mass extinctions based on fossil records, moreunderstandings on the complexity of mass extinctions arise. For example, the P-Tr mass extinction can bedivided into two phases: End-Guadalupian and end-Changhsingian [65] [66] [67] [68] [69]; furthermore theend-Changhsingian stage can be subdivided into main stage (B line) and final stage (C line) [70] [71], andthe course of extinction may happened very rapidly [72] [73] [74] [75] [76]. Such a complex pattern of the

28


https://doi.org/10.1101/017988

mass extinction challenges any explanations on this problem. it is also almost impossible to attribute a massextinction to a single control factor. Based on the above discussions, it is wise to consider a multifactorialexplanation for the mass extinctions. In consideration of the interactions among the lithosphere, hydrosphere,atmosphere and biosphere, the tectonic movement of the aggregation and dispersal of Pangaea influenced thebiodiversity net fluctuation curve at several time scales through the impacts of hydrosphere and atmosphere(Fig 4B).

Before explaining the mass extinctions based on the tectonic movement, two preparations are required:first, to construct a Phanerozoic eustatic curve; second, to construct a Phanerozoic climatic curve.

The Phanerozoic eustatic curve is based on the following two results (Fig S4b3): (i) Haq sea level curve[77] [78], (ii) Hallam sea level curve [79]. In 1970s, Seismic stratigraphy developed a method called sequencestratigraphy to plot the sea level fluctuations. Vail in the Exxon Production Research Company obtained asea level curve based on this method, which concerns controversial due to business confidential data. Haqet al. replotted a sea level curve for the mesozoic era and cenozoic era also based on this method [77], anda sea level curve for the paleozoic era [78]. The Haq sea level curve here is obtained by combining thesetwo curves (Fig S4b3). Hallam obtained a sea level curve by investigating large amounts of data on sea levelvariations [79] (Fig S4b3). Generally speaking, it is not quite different between the Haq sea level curve andthe Hallam sea level curve. Therefore, a Phanerozoic eustatic curve can be obtained by averaging the Haqsea level curve and the Hallam sea level curve after nondimensionalization at equal weights (Fig S4b3).

The Phanerozoic climatic curve is based on the following four results (Fig S4b2): (i) the climatic gradientcurve based on climate indicators in the Phanerozoic eon [80] [81] (Fig S4b1); (ii) the atmospheric CO2

fluctuations in the Berner’s models that emphasise the weathering role of tracheophytes [82] [83] [84]; (iii)the marine carbonate 87S r/86S r record in the Phanerozoic eon in Raymo’s model [85] [86]. The more positive87S r/86S r values, assumed to indicate enhanced CO2 levels and globally colder climates; more negative87S r/86S r values, indicating enhanced sea-floor speeding and hydrothermal activity, are associated withincreased CO2 levels; (iv) the marine δ18O record in the Phanerozoic eon based on thermodynamic isotopeeffect [86] [87]. The above four independent results are quite different to one another (Fig S4b2). Consideringthe complexity of the problem on the Phanerozoic climate change yet, all the four methods are reasonablein principle on their grounds respectively. A consensus Phanerozoic climatic curve is obtained based onthe above four independent results after nondimensionalization at equal weights. Namely, the Phanerozoicclimatic curve is defined as the average of nondimensionalized climate indicator curve [80] [81], Berner’sCO2 curve [82], 87S r/86S r curve [85] [86] and δ18O curve [86] [87] (Fig S4b2).

Nondimensionalization is a practical and reasonable method to obtain the consensus Phanerozoic climaticcurve. As the temperature in a certain time is concerned, when the four results agree to one other, thecommon temperatures are adopted; when the four results disagree, the majority consensus temperatures areadopted (Fig S4b2). According to the Phanerozoic climatic curve, the temperature was high from Cambrianto Middle Ordovician (Fig S4b2); the temperature was low during O-S (Fig S4b2); the temperature increases

29


https://doi.org/10.1101/017988

from Silurian to Devonian, and it reach high during F-F (Fig S4b2); the temperature is extreme low duringD-C (Fig S4b2); the temperature rebounded slightly during Missisippian, and the temperature became loweragain during Pennsylvanian, and the temperature decreased to extreme low during early Permian (Fig S4b2);the temperature increase in Permian, and especially the temperature increase rapidly during the Lopingian,and reach an extreme high peak during P-Tr (Fig S4b2); the temperature is high from Lower Triassic toLower Jurassic, and it reached to extreme high during Lower Jurassic (Fig S4b2); the temperature becamelow from Middle Jurassic to Lower Cretaceous, where it is coldest in the Mesozoic era during J-K, whichwas not low enough to form glaciation (Fig S4b2); the temperature is high again from Upper Cretaceous toPaleocene, and the detailed high temperature during the Paleocene-Eocene event is observed (Fig S4b2); thetemperature decreased from Eocene to present, especially reached to the minimum in the Phanerozoic eonduring Quaternary (Fig S4b2). Four glacier ages are observed in the Phanerozoic eon (Fig S4b2): Hirnantianglacier age, Tournaisian glacier age, early Permian glacier age and Quaternary glacier age, especially bipolarice sheet at present. The Phanerozoic climatic curve also indicates the extreme low temperature during J-K(Fig S4b2), which corresponds to possible continental glaciers at high latitudes in the southern hemisphere[88]. In short, the fine-resolution Phanerozoic climatic curve agrees with the common sense on climatechange at the tectonic time scale.

The Phanerozoic climatic curve is taken as a standard curve to evaluate the four independent methods.The climate indicator curve is the best result among the four. It accurately reflects the four high temperatureperiods and four low temperature periods, and the varying amplitudes of the curve are relatively modestand reasonable. But it overestimated the temperature during Pennsylvanian (Fig S4b2). The δ18O curve isalso reasonable. It also reflects the four high temperature periods and four low temperature periods. But itoverestimated the temperature during Guadalupian, and underestimated the temperature during Jurassic (FigS4b2). The 87S r/86S r curve is reasonable to some extent. It reflects the high temperature during Mesozoicand low temperature during Carboniferous and Quaternary. But it underestimated the temperature duringCambrian, and overestimated the temperature during O-S and J-K (Fig S4b2). The Berner’s CO2 curve isrelatively rough. It reflects the high temperature during Mesozoic, But it overestimated the temperature fromCambrian to Devonian (Fig S4b2).

As the climate changes at the tectonic time scale is concerned, the Berner’s CO2 curve indicates relativelylong term climate changes; the climate indicator curve indicates relatively mid term climate changes; andthe 87S r/86S r and δ18O curves indicate relatively short term climate changes (Fig S4b2). The scale of thevariation of the Phanerozoic climatic curve is relatively accurate because this consensus climatic curve isobtained by combining all the different time scales of the above four climatic curves. Four high temperatureperiods Cambrian, Devonian, Triassic, K-Pg and four low temperature periods O-S, Carboniferous, J-K,Quaternary appeared alternatively in the Phanerozoic climatic curve, which generally reflect four periods inthe climate changes in the Phanerozoic eon (Fig S4b2). Such a time scale of the four periods of climatechanges is about equal to the average interval of mass extinctions, and is also equal to the time scale of thefour periods of the variation of the relative abundance of microbial carbonates in reef during Phanerozoic [89][90] [91] [92]. The agreement of the time scales indicates the relationship between the climate change and

30


https://doi.org/10.1101/017988

the fluctuations of biodiversity in the Phanerozoic eon.

The sea level fluctuations and the climate change have an important impact on the fluctuations of biodiversity.With respect to the impact of the sea level fluctuations, this paper follows the traditional view that sea levelfluctuations play positive contributions to the fluctuations of biodiversity. With respect to the impact ofthe climate change, this paper believes that climate change plays negative contributions to the fluctuationsof biodiversity. The climate gradient curve is opposite to the climate curve (Fig S4b2). The higher thetemperature is, the lower the climate gradient is, and consequently the less the global climate zones are; whilethe lower the temperature is, the higher the climate gradient is, and consequently the more the global climatezones are. When the climate gradient is high, there were tropical, arid, temperate and frigid zones among theglobal climate zones, which boosts the biodiversity. So, this paper believes that the biodiversity increases withthe climate gradient. This point of view agrees with the following observations. The biodiversity reboundedduring the low temperature period Carboniferous (Fig S4b5), and increased steadily during the ice age duringQuaternary (Fig S4b5); and the four mass extinctions F-F, P-Tr, Tr-J, K-Pg appeared in high temperatureperiods (Fig S4b5). Although the O-S mass extinction occurred in the ice age (Fig S4b5), this case is quitedifferent from other mass extinctions. In the first mass extinction during O-S, mainly the Cambrian fauna,which originated in the high temperature period, cannot survive from the O-S ice age. Mass extinctions nolonger occurred thereafter in the ice ages for those descendants who have successfully survived from the O-Sice age (Fig S4b5).

Based on the above analysis, the eustatic curve and the climatic gradient curve is positively related to thefluctuations of biodiversity in the Phanerozoic eon. Accordingly, a climato-euctatic curve is defined by theaverage of the nondimensionaliezed eustatic curve and the nondimensionaliezed climatic grandaunt curve,which indicates the impact of environment on biodiversity (Fig S4b4). It should be emphasised that theclimato-euctatic curve is by no means based on fossil records, which is only based on climatic and eustaticdata. It is obvious that the climato-euctatic curve agrees with the biodiversity net fluctuation curve (Fig4B). Judging from the overall features, the outline of the history of the Phanerozoic biodiversity can bereconstructed in the climato-eustatic curve, as far as the Cambrian radiation and the Ordovician radiation, thePaleozoic diversity plateau, the P-Tr mass extinction, and the Mesozoic and Cenozoic radiation are concerned(Fig 4B). Judging from the mass extinctions, the five mass extinctions O-S, F-F, P-Tr, Tr-J, K-Pg can bediscerned from the five steep descents in the climato-eustatic curve respectively (Fig 4B, S4b7). As to K-Pgespecially, there is obviously a pair of descent and ascent around K-Pg in the climato-eustatic curve (Fig4B), so celestial explanations are actually not necessary here. Judging from the minor extinctions, mostof the minor extinctions can be discerned in the climato-eustatic curve in details, as far as the following areconcerned: 3 extinctions during Cambrian, Cm-O extinction, S-D extinction, 2 extinctions during Carboniferous,1 extinction during Permian, 1 extinction during Triassic, 3 extinctions during Jurassic, 2 extinctions duringCretaceous, 2 extinctions during Paleogene, 1 extinction during Neogene and the extinction at present (Fig4B, S4b7). Judging from the increment of the biodiversity, the originations and radiations can be discernedfrom the ascents in the climato-eustatic curve, as far as the following are concerned: the Cambrian radiation,the Ordovician radiation, the Silurian radiation, the Lower Devonian diversification, the Carboniferous rebound,

31


https://doi.org/10.1101/017988

the Triassic rebound, the Lower Jurassic rebound, the Middle Cretaceous diversification, the Paleogenerebound (Fig 4B, S4b7). The derivative of the climato-eustatic curve indicates the variation rates (Fig 4B,S4b7). The extinctions are discerned by negative derivative, and the rebounds and radiations by positivederivative. It is obvious to discern the five mass extinctions, most of minor extinctions, and rebounds andradiations in the derivative curve of the climato-eustatic curve, just like in the climato-eustatic curve (Fig 4B,S4b7).

The interactions among the lithosphere, hydrosphere, atmosphere, biosphere are illustrated by the eustaticcurve, climatic curve, biodiversity net fluctuation curve and the aggregation and dispersal of Pangaea (FigS4b5, S4b6). There were about two long periods in the eustatic curve, which were divided by the event of themost aggregation of Pangaea (Fig S4b5). The first period comprises the grand marine transgression from theCambrian to the Ordovician and the grand marine regression from the Silurian to the Permian, and the secondperiod comprises the grand marine transgression during Mesozoic and the grand regression during Cenozoic(Fig S4b5). These observations reveal the tectonic cause of the grand marine transgressions and regressionsin the Phanerozoic eon, considering the aggregation and dispersal of Pangaea. The coupling between thehydrosphere and the atmosphere can be comprehended by comparing the eustatic curve and the climatic curve.The global temperature was high during both the grand marine transgressions (Fig S4b5). And all the four iceages in the Phanerozoic occurred during the grand marine regressions (Fig S4b5). Generally speaking, theclimate gradient curve varies roughly synchronously with the eustatic curve from the Cambrian to the LowerCretaceous along with the aggregation of Pangaea, except for the regression around the Tournaisian ice ages(Fig S4b5). And the climate gradient curve varies roughly oppositely with the eustatic curve thereafter fromthe Upper Cretaceous to the present along with the dispersal of Pangaea (Fig S4b5); namely the climategradient curve increases as the eustatic curve decreases. These observations reveal the tectonic cause of theclimate change at tectonic time scale in the Phanerozoic eon, considering the impact on the climate from thegrand transgressions and regressions. The biodiversity responses almost instantly to both the eustatic curveand the climatic curve, because the life spans of individuals of any species are much less than the tectonic timescale. Hence, the biodiversity net fluctuation curve agrees with the climato-eustatic curve very well (Fig 4B).In conclusion, the tectonic movement has substantially influenced the biodiversity through the interactionsamong the earth’s spheres (Fig S4b5, S4b6).

The interactions among the earth’s spheres can explain the outline of the biodiversity fluctuations in thePhanerozoic eon. During Cambrian and Ordovician, both the sea level and the climate gradient increased, sothe biodiversity increased (Fig S4b5). From Silurian to Permian, both the sea level and the climate gradientdecreased, so the biodiversity decreased (Fig S4b5), except for the biodiversity plateau during Carboniferousdue to the opposite increasing climate gradient and decreasing sea level (Fig S4b5). From Triassic to LowerCretaceous, both the sea level and the climate gradient increased, so the biodiversity increased (Fig S4b5).From Upper Cretaceous to present, the sea level decreased and the climate gradient increased rapidly, so thebiodiversity increased (Fig S4b5).

The interactions among the earth’s spheres can explain the five mass extinctions within the same theoretical

32


https://doi.org/10.1101/017988

framework, though their causes are different in detail. The mass extinctions O-S, K-Pg occurred around thepeaks in the eustatic curve respectively, and the P-Tr mass extinction around the lowest point in the eustaticcurve. There were about two periods in the eustatic curve and about four periods in the climatic curve in thePhanerozoic eon. So the period of climate change was about a half of the period of sea level fluctuations, as faras their changes at the tectonic time scale are concerned (Fig S4b5). By comparison, the F-F mass extinctionoccurred in the low level of climate gradient. There was a triple plunge from the high level of biodiversityduring the end-Ordovician to the extreme low level of biodiversity during Triassic in the biodiversity netfluctuation curve, which comprised a series of mass extinctions O-S, F-F, P-Tr (Fig 4B). It is interesting that arelatively less great extinction may occur after each of the triple plunge mass extinctions respectively: namelythe S-D extinction after the O-S mass extinction, the Carboniferous extinction after the F-F mass extinction,and the Tr-J mass extinction after the P-Tr mass extinction (Fig 4B). Especially during the end of Permian,the aggregation of the Pangaea pull down both the eustatic curve and the climate gradient curve, and hencepull down the biodiversity curve (Fig S4b5). After a long continual descent in the biodiversity net fluctuationcurve and after the two mass extinctions O-S, F-F, the ecosystem was more fragile at the end of Permian thanin the previous period when biodiversity was at the high level, therefore the P-Tr mass extinction occurred asthe most severe extinction event in the Phanerozoic eon (Fig 4B). The P-Tr mass extinction can be dividedinto the first phase at the end of Guadalupian and the second phase at the end of Changhsingian (Fig 4B).The extinction of Fusulinina mainly occurred at the end of Guadalupian; while the extinction of Endothyrinamainly occurred at the end of Changhsingian. Note that the number of Endothyrina did not decrease obviouslyat the first phase. Different extinction patterns in the above two phases indicate that the triggers for the twophases are different to each other. Marine regression occurred at the end of the Guadalupian, whereas itis indicated that no marine regression occurred at the end of the Changhsingian. The second phase of theend-Changhsingian extinction may be caused by the declining climate gradient (or increasing temperature).All the four independent methods in the above (climatic-sensitive sediments, Berner’s CO2, 87S r/86S r andδ18O) predict the rapidly increasing temperature at the end of the Changhsingian (Fig S4b2). Either Siberiatrap or the methane hydrates can raise the temperature at the end of the Changhsingian, which may trigger theP-Tr mass extinction. Four mass extinctions F-F, P-Tr, Tr-J, K-Pg and the Cm-O extinction occurred duringthe low level of climate gradient (Fig S4b5); as a special case, the O-S mass extinction occurred during thehigh level of climate gradient (Fig S4b5). Taken together, these observations show that low climate gradientcan contribute to the mass extinctions substantially. Generally speaking, the mass extinctions might occurcoincidentally when brought together two or more factors among the low climate gradient, marine regression,and the like (Fig S4b5).

The interactions among the earth’s spheres can explain the minor extinctions, radiations and rebounds.The minor descents in the biodiversity net fluctuation curve relate to mid-term or short-term changes insea level and climate at the tectonic time scale, which correspond approximately to the minor extinctionsrespectively [93] [94] (Fig 4B, S4b7). It is shown that most minor extinctions in the Paleozoic era correspondedto low climate gradient, and most minor extinctions during the Mesozoic and Cenozoic corresponded tomarine regressions (Fig 4B, S4b7). After mass extinctions, both the climate gradient and sea level tended to

33


https://doi.org/10.1101/017988

increase; henceforth the radiations and rebounds occurred. Especially for the case of the P-Tr mass extinction,the climate gradient remained low throughout the Lower Triassic (Fig 4B, S4b5), so it took a long time forthe recovery from the P-Tr mass extinction.

3.3 Reconstruction of the Phanerozoic biodiversity curve and the strategy of theevolution of life

Now, let’s reconstruct the Phanerozoic biodiversity curve. In Subsection 3.1, the exponential growth trend ofthe Phanerozoic biodiversity curve has been explained by the exponential growth trend of the genome sizeevolution (Fig 4A); in Subsection 3.2, the deviation from the exponential growth trend of the Phanerozoicbiodiversity curve has been explained by the climato-eustatic curve (Fig 4B). By combining the exponentialgrowth trend of the genome size evolution and the climato-eustatic curve, the biodiversity curve based onfossil records can be reconstructed based on the genomic, climatic and eustatic data (Fig 4C). The techniquedetails are as follows (Fig S4c1). The reconstructed biodiversity net fluctuation curve is obtained based on theclimato-eustatic curve times a certain weight, where the weight is the standard deviation of the biodiversitynet fluctuation curve (Fig S4c1). And the slope and the intercept of the line that corresponds to the exponentialgrowth trend of the reconstructed biodiversity is obtained as follows. Its slope is defined by the slope of theline that corresponds to the exponential growth trend of genome size evolution (Fig S4c1), and its interceptis defined by the intercept of the line that corresponds to the exponential growth trend of the biodiversitycurve (Fig S4c1). The reconstructed biodiversity curve thus obtained agrees nicely with the Bambach et al.’sbiodiversity curve within the error range of fossil records (Fig 4C).

The above split and reconstruction scheme of the Phanerozoic biodiversity curve can explain not onlythe exponential growth trend and the detailed fluctuations, but also the roughly synchronously decliningorigination rate and extinction rate throughout the Phanerozoic eon [95] [96] [97] [98] [99] (Fig 4D, S4d1,S4d2, S4d3). Ramp and Sepkoski first found the declining extinction rate throughout the Phanerozoic eon[100] [101]. And Sepkoski also found the declining origination rate throughout the Phanerozoic eon [94].The numbers of origination genera and extinction genera are marked in the Bambach et al.’s biodiversitycurve. The origination rate and extinction rate are obtained from the numbers of origination or extinctiongenera divided by the corresponding time intervals (Fig S4d3). The difference between the origination rateand extinction rate is the variation rate of number of genera (Fig S4d3). It is shown that both the originationrate and extinction rate decline roughly synchronously throughout the Phanerozoic eon [102] (Fig S4d3).

The rough synchronicity between the origination rate and the extinction rate can be explained according tothe split and reconstruction scheme. The Phanerozoic biodiversity curve can be divided into two independentparts: an exponential growth trend and a biodiversity net fluctuation curve (Fig S4a8, S4c1). Notice thatthere is no declining trend in the climato-eustatic curve, the expectation value of which is constant (Fig 4B,S4d1); whereas the biodiversity curve was continuously boosted upwards by the exponential growth trendof genome size evolution (Fig 4A). The expectation value of the difference between the origination rate

34


https://doi.org/10.1101/017988

and extinction rate corresponds to the exponential growth trend of biodiversity boosted by the genome sizeevolution (Fig S4d1, S4d2). If too large or too small origination rate brings the biodiversity to deviate fromthe exponential growth trend, the extinction rate has to response by increasing or decreasing so as to maintainthe constant exponential growth trend of the biodiversity (Fig S4d2, S4d3). The rough synchronicity betweenthe origination rate and the extinction rate shows that the growth trend of biodiversity at the species level wasconstrained strictly by the exponential growth trend of genome size evolution at the sequence level.

The declining trends of the origination rate and the extinction rate can be explained according to the splitand reconstruction scheme. It is necessary to assume that there is an upper limit of ability to create newspecies in the living system, which is due to the constraint of the earth’s environment at the species leveland the limited mutation rate at the sequence level (Fig S4d1). Although there is no declining trend forthe variation rate of the biodiversity net fluctuation curve (Fig S4b7, S4d1), the trend of the variation rateof the Phanerozoic biodiversity curve tends to decline in general, because the exponential growth trend ingenome size evolution is added in the denominator in the calculations (Fig 4D, S4d2, S4d3). The decliningvariation rate throughout the Phanerozoic eon shows that the genome size evolution played a primary role inthe growth of biodiversity. The exponential growth trend of biodiversity driven by the genome size evolutioncan gradually weaken the impact on the biodiversity fluctuations from the changes of sea level and climate(Fig 4D, S4d2). The rough synchronicity between the origination rate and the extinction rate can explainboth the declining origination rate and the declining extinction rate throughout the Phanerozoic eon (FigS4d2, S4d3).

The biodiversity evolves independently at the sequence level and at the species level. This ensuresthe adaptation of life to the earth’s changing environment. The essential feature of the living system isits homochirality. At the beginning of life, the homochirality, the genetic code, and the genomic codondistributions had established. The homochirality ensures the safety for the living system: the earth’s changingenvironment influenced the living system only at the species level, which cannot directly threaten the existenceof life at the sequence level. The strategy of adaptation is that: life generates numerous redundant speciesat the sequence level; thus life sacrificed opportunity of survival for the vanished majority of the species inexchange for the present outstanding adaptability to the changing environment. By virtue of this strategy ofadaptation, the chiral living system may adapt to a wide range of environmental changes on planets.

35


https://doi.org/10.1101/017988

References

[1] V. N. Soyfer, V. N. Potaman, Triple-Helical Nucleic Acids (Springer-Verlag, New York, 1996).

[2] B. P. Belotserkovskii et al., Formation of intramolecular triplex in homopurine-homopyrimidine mirror repeatswith point substitutions. Nucleic Acids Research 18, 6621-6624 (1990).

[3] J. T. Wong, Coevolution theory of the genetic code at age thirty. BioEssays 27, 416-425 (2005).

[4] E. N. Trifonov et al., Primordia vita. deconvolution from modern sequences. Orig Life Evol Biosph 36, 559-565(2006).

[5] M. Di Giulio, On the origin of the transfer RNA molecule. J. Theor. Biol. 159, 199-214 (1992).

[6] M. Di Giulio, Was it an ancient gene codifying for a hairpin RNA that, by means of direct duplication, gave riseto the primitive tRNA molecule? J. Theor. Biol. 177, 95-101 (1995).

[7] M. Di Giulio, The nonmonophyletic origin of tRNA molecule. J. Theor. Biol. 197, 403-414 (1999).

[8] M. Di Giulio, The origin of the tRNA molecule: Implications for the origin of protein synthesis. J. Theor. Biol.226, 89-93 (2004).

[9] M. Di Giulio, Nanoarchaeum equitans is a living fossil. J. Theor. Biol. 242, 257-260 (2006).

[10] R. F. Gesteland et al. eds., The RNA World (Cold Spring Harbor Laboratory, New York, 2006), Third Edition.

[11] G. Eriani et al., Partition of tRNA synthetases into two classes based on mutually exclusive sets of sequencemotifs. Nature 347, 203-206 (1990).

[12] Y. Nambu and G. Jona-Lasinio, Dynamical Model of Elementary Particles Based on an Analogy withSuperconductivity. I. Phys. Rev. 122, 345-358 (1961).

[13] Y. Nambu and G. Jona-Lasinio, Dynamical Model of Elementary Particles Based on an Analogy withSuperconductivity. II. Phys. Rev. 124, 246-254 (1961).

[14] A. Muto, S. Osawa, The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl.Acad. Sci. U.S.A. 84, 166-169 (1987).

[15] A. Gorban, Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences.Physica A 353, 365-387 (2005).

[16] E. N. Trifonov et al., Distinc stage of protein evolution as suggested by protein sequence analysis. J. Mol. Evol.53, 394-401 (2001).

[17] E. N. Trifonov, The triplet code from first principles. Journal of Biomolecular Structure & Dynamics 22:1 (2004).

[18] I. S. Povolotskaya et al., Stop codons in bacteria are not selectively equivalent. Biology Direct 7:30 (2012).

[19] J. T. Wong, A coevolution theory of the genetic code. Proc. Natl. Acad. Sci. U.S.A. 72, 1909-1912 (1975).

[20] J. T.-F. Wong, A. Lazcano, Prebiotic Evolution and Astrobiology (Landes Bioscience, Austin Texas, 2009).

[21] D. J. Li, S. Zhang, Genetic code evolution as an initial driving force for molecular evolution. Physica A 388,3809-3825 (2009).

[22] E. N. Trifonov, Consensus temporal order of amino acids and evolution of the triplet code. Gene 261, 139-151(2000).

36


https://doi.org/10.1101/017988

[23] F. Delsuc et al., Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6, 361-375 (2005).

[24] J. Qi et al., Whole genome prokaryote phylogeny without sequence alignment: A K-string composition approach.J. Mol. Evol. 58, 1-11 (2004).

[25] J. Felsenstein, Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17,368-376 (1981).

[26] C. R. Woese et al., Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, andEucarya. Proc. Natl. Acad. Sci. U.S.A. 87, 4576-4579 (1990).

[27] J. A. Lake et al., Eocytes: A new ribosome structure indicates a kingdom with a close relationship to eukaryotes.Proc. Natl. Acad. Sci. U.S.A. 81, 3786-3790 (1984).

[28] J. A. Lake, Origin of the eukaryotic nucleus determined by rate-invariant analysis of rRNA sequences. Nature331, 184-186 (1988).

[29] J. M. Archibald, The eocyte hypothesis and the origin of eukaryotic cells. Proc. Natl. Acad. Sci. U.S.A. 105,20049-20050 (2008).

[30] C. J. Cox et al., The archaebacterial origin of eukaryotes. Proc. Natl. Acad. Sci. U.S.A. 105, 20356-20361 (2008).

[31] J. J. Sepkoski, Jr., A compendium of fossil marine animal genera. Bulletins of American Paleontology No. 363(2002).

[32] R. K. Bambach et al., Origination, extinction, and mass depletions of marine diversity. Paleobiology 30, 522-542(2004).

[33] R. A. Rohde, R. A. Muller, Cycles in fossil diversity. Nature 434, 208-210 (2005).

[34] A. A. Sharov, Genome increases as a clock for the origin and evolution of life. Biology Direct 1, 17 (2007).

[35] D. J. Li, S. Zhang, The Cambrian explosion triggered by critical turning point in genome size evolution.Biochemical and Biophysical Research Communications 392, 240-245 (2010).

[36] T. R. Gregory, Animal Genome Size Database, http://www.genomesize.com.

[37] M. D. Bennett, I. J. Leitch, Plant DNA C-values database (release 5.0, Dec. 2010), http://www.kew.org/cvalues/.

[38] K. J. Willis, J. C. McElwain, The Evolution of Plants (Oxford Univ. Press, Oxford, 2014).

[39] M. R. Gibling, N. S. Davies, Palaeozoic landscapes shaped by plant evolution. Nature Geoscience 5, 99-105(2012).

[40] T. L. P. Couvreur, Early evolutionary history of the flowering plant family Annonaceae: steady diversificationand boreotropical geodispersal. Journal of Biogeography 38, 664-80 (2011).

[41] D. Shu, Cambrian explosion: Birth of tree of animals. Gondwana Research 14, 219-240 (2008).

[42] S. Conway-Morris, The fossil record and the early evolution of the metazoa. Nature 361, 219-225 (1993).

[43] S. Conway-Morris, The Burgess shale fauna and the Cambrian explosion. Science, 339-346 (1989).

[44] G. Budd, S. Jensen, A critical reappraisal of the fossil record of the bilaterian phyla. Biological Review 75,253-295 (2000).

[45] D. Shu, On the phylum vetulicolia. Chinese Science Bulletin 50, 2342-2354 (2005).

37


https://doi.org/10.1101/017988

[46] D. Shu et al., Ancestral echinoderms from the Chengjiang deposits of China. Nature 430, 422-428 (2004).

[47] J. W. Valentine, How were vendobiont bodies patterned? Palaeobiology 27, 425-428 (2001).

[48] D. Shu et al., Restudy of cambrian explosion and formation of animal tree. ACTA Palaeontologica Sinica 48,414-427 (2009).

[49] J. J. Sepkoski, Jr., A factor analytic description of the Phanerozoic marine fossil record. Paleobiology 7, 36-53(1981).

[50] J. J. Sepkoski, Jr., Biodiversity: Past, present, and future. Journal of Paleontology 71, 533-539 (1997).

[51] J. J. Sepkoski, Jr., A kinetic model of Phanerozoic taxonomic diversity I, Analysis of marine orders. Paleobiology4, 223-251 (1978).

[52] J. J. Sepkoski, Jr., A kinetic model of Phanerozoic taxonomic diversity II, Early Phanerozoic families andmultiple equilibria. Paleobiology 5, 222-251 (1979).

[53] J. J. Sepkoski, Jr., A kinetic model of Phanerozoic taxonomic diversity III, Post-Paleozoic families and multipleequilibria. Paleobiology 10, 246-267 (1984).

[54] D. Hewzulla et al., Evolutionary patterns from mass originations and mass extinctions. Philos. Trans. R. Soc.London Ser. B 354, 463-469 (1999).

[55] P. B. Wignall, A. Hallam, Anoxia as a cause of the Permian/Triassic extinction: facies evidence from northernItaly and the western United States. Palaeogeography, Palaeoclimatology, Palaeoecology 93, 21-46 (1992).

[56] P. B. Wignall, A. Hallam, Griesbachian (Earliest Triassic) Palaeoenvironmental changes in the Salt Range,Pakistan and southeast China and their bearing on the Permo-Triassic mass extinction. Palaeogeography,Palaeoclimatology, Palaeoecology 102, 215-237 (1993).

[57] P. B. Wignall, R. J. Twitchett, Oceanic anoxia and the end Permian mass extinction. Science 272, 1155-1158(1996).

[58] H. Yin, J. Tong, Multidisciplinary high-resolution of the Permian-Triassic boundary. Palaeogeography,Palaeoclimatology, Palaeoecology 143, 199-211 (1998).

[59] D. H. Erwin et al., End-Permian mass extinction: A review. In C. Koeberl and K. G. MacLeod eds., CatastrophicEvents and Mass Extinctions: Impacts and Beyond. Geological Society of America Special Paper 356, 363-383(2002)

[60] P. R. Renne, A. R. Basu, Rapid eruption of the Siberia Traps flood basalts at the Permo-Triassic boundary.Science 253, 176-179 (1991).

[61] I. H. Campbell et al., Synchronism of the Siberia Traps and the Permian-Triassic boundary. Science 258,1760-1763 (1992).

[62] L. W. Alvarez et al., Extraterrestrial cause for the Cretaceous? Tertiary extinction. Science 208, 1095-1108(1980).

[63] G. Keller et al., Chicxulub impact predates the K-T boundary mass extinction. Proc. Natl. Acad. Sci. U.S.A. 101,3753-3758 (2004).

[64] S. A. Stewart, P. J. Allen, A 20-km-diameter multi-ringed impact structure in the North Sea. Nature 418, 520-523(2002).

[65] Y. G. Jin, Two phases of the end-Permian extinction. Palaeoworld 1, 39 (1991).

38


https://doi.org/10.1101/017988

[66] Y. G. Jin, The pre-Lopingian benthose crisis. Compte Rendu, the 12th ICC-P 2, 269-278 (1993).

[67] S. M. Stanley, X. Yang, A double mass extinction at the end of the Paleozoic era. Science 266, 1340-1344 (1994).

[68] S. Shen et al., Calibrating the End-Permian mass extinction. Science 334, 1367-1372 (2011).

[69] Y. G. Jin et al., Pattern of marine mass extinction near the Permian-Triassic boundary in south China. Science289, 432-436 (2000).

[70] S. Xie et al., Two episodes of microbial change coupled with Permo/Triassic faunal mass extinction. Nature 434,494-497 (2005).

[71] Z. Chen, M. J. Benton, The timing and pattern of biotic recovery following the end-Permian mass extinction.Nature Geoscience 5, 375-383 (2012).

[72] S. A. Bowring et al., The tempo of mass extinction and recovery: The end Permian example. Proc. Natl. Acad.Sci. U.S.A. 96, 8827-8828 (1999).

[73] Y. Eshet et al., Fungal event and palynological record of the ecological crisis and recovery across thePermian-Triassic boundary. Geology 23, 967-970 (1995).

[74] M. R. Rampino et al., Tempo of the end-Permian event, High-resolution cyclostratigraphy at thePermian-Triassic boundary. Geology 28, 643-646 (2000).

[75] P. D. Ward et al., Altered river morphology in South Africa related to the Permian-Triassic extinction. Science289, 1740-1743 (2000).

[76] R. J. Twichett, Incompleteness of the Permian-Triassic fossil record, a consequence of productivity decline?Geological Journal 36, 341-353 (2001).

[77] B. U. Haq et al., Chronology of fluctuating sea levels since the triassic. Science 235, 1156-1167 (1987).

[78] B. U. Haq, S. R. Schutter, A Chronology of Paleozoic sea-level changes. Science 322, 64-68 (2008).

[79] A. Hallam, Phanerozoic Sea Level Changes (Columbia Univ. Press, New York, 1992).

[80] A. J. Boucot, J. Gray, A critique of Phanerozoic climatic models involving changes in the CO2 content of theatmosphere. Earth-Science Reviews 56, 1-159 (2001).

[81] A. J. Boucot et al., Reconstruction of the Phanerozoic Global Paleoclimate (Science Press, Beijing, 2009).

[82] R. A. Berner, The carbon cycle and CO2 over Phanerozoic time: the role of land plants. Phil. Trans. R. Soc.Lond. B. 353, 75-82 (1998).

[83] R. A. Berner, Z. Kothavala, GEOCARB III: A revised model of atmospheric CO2 over phanerozoic time.American Journal of Science 301, 182-204 (2001).

[84] R. A. Berner et al., Phanerozoic atmospheric oxygen. Annu. Rev. Earth Planet Sci. 31, 105-134 (2003).

[85] M. E. Raymo, Geochemical evidence supporting T. C. Chamberlin’s theory of glaciation. Geology 19, 344-347(1991).

[86] J. Veizer et al., 87S r/86S r, δ13C and δ18O evolution of Phanerozoic seawater. Chemical Geology 161, 59-88(1999).

[87] J. Veizer et al., Evidence for decoupling of atmospheric CO2 and global climate during the Phanerozoic eon.Nature 408, 698-701 (2000).

39


https://doi.org/10.1101/017988

[88] H. M. Stoller, D. P. Schrag, Evidence for glacial control of rapid sea level changes in the Early Cretaceous.Science 272, 1771-1774 (1996).

[89] R. Riding, L. Liang, Geobiology of microbial carbonates: metazoan and seawater saturation state influenceson secular trends during the Phanerozoic. Palaeogeography, Palaeoclimatology, Palaeoecology 219, 101-115(2005).

[90] R. Riding, Phanerozoic reefal microbial carbonate abundance: comparisons with metazoan diversity, massextinction events, and seawater saturation state. Revista Espanola de Micropaleontologia 37, 23-38 (2005).

[91] R. Riding, Microbial carbonate abundance compared with fluctuations in metazoan diversity over geologicaltime. Sedimentary Geology 185, 229-238 (2006).

[92] W. Kiessling et al., Paleoreef maps: evaluation of a comprehensive database on Phanerozoic reefs. AAPGBulletin 83, 1552-1587 (1999).

[93] M. J. Benton, D. A. T. Harper, Basic Palaeontology (Addison Wesley Longman, Essex, England, 1997).

[94] J. J. Sepkoski, Jr., Rates of speciation in the fossil record. Philosophical Transactions of the Royal Society ofLondon, Series B 333, 315-326 (1998).

[95] K. W. Flessa, D. Jablonski, Declining Phanerozoic background extinction rates: effect of taxonomic structure.Nature 313, 216-218 (1985).

[96] L. Van Valen, How constant is extinction? Evol. Theory 7, 93-106 (1985).

[97] J. J. Sepkoski, Jr., A model of onshore-offshore change in faunal diversity. Paleobiology 17, 58-77 (1991).

[98] N. L. Gilinsky, Volatility and the Phanerozoic decline of background extinction intensity. Paleobiology 20,445-458 (1994).

[99] J. Alroy, Equilibrial diversity dynamics in north American mammals, in M. L. McKinney, J. A. Drake ed.,Biodiversity Dynamics: Turnover of Populations, Taxa, and Communities (Columbia Univ Press, New York,1998).

[100] D. M. Raup, J. J. Sepkoski, Jr., Mass extinctions in the marine fossil record. Science 215, 1501-1503 (1982).

[101] M. E. J. Newman, G. J. Eble, Decline in extinction rates and scale invariance in the fossil record. Paleobiology25, 434-439 (1999).

[102] D. Jablonski, Species selection: Theory and data. Annu. Rev. Ecol. Evol. Syst. 39, 501-524 (2008).

E-mail: [email protected]

Acknowledgements My warm thanks to Jinyi Li for valuable discussions. Supported by the FundamentalResearch Funds for the Central Universities.

40


https://doi.org/10.1101/017988

Index

Section 1roadmap for the evolution of the genetic code

(the roadmap for short)initiation stage of the roadmap

expansion stage of the roadmap

ending stage of the roadmap

origin of the homochirality of lifeorigin of the genetic codeorigin of the three domains of lifebase pair of double helix DNAbase triplex of triplex DNArelative stability of base triplexbase substitutions in the roadmap

codon pair in the roadmap

pair connection in the roadmap

route duality in the roadmap

Route 0 − 3 of the roadmap

Hierarchy 1 − 4 of the roadmap

initial subset in the roadmap

recruitment order of codon pairs (#1 to #32)recruitment order of amino acids (No. 1 to No. 20)cooperative recruitment of codons and amino acidsnon-standard codons in the roadmap

origin of stop codonsorigin of tRNAsbranch node in the roadmap

leaf node in the roadmap

palindromic codon pairnon-palindromic codon paircomplementary single RNA strandsassignment scheme of anticodonspermutations of basescombinations of basesexplanation of number of canonical amino acidsexplanation of the wobble paring ruleexplanation of codon degeneracyminimum subset in the roadmap

facet of cube for routebiosynthetic families of amino acidsstandard genetic code tableGCAU genetic code tableclassification of aminoacyl-tRNA synthetaseschiral roadmapwinner-take-all competition between chiral roadmapssimulation of the origin of homochiralityexplanation of position specific GC contentexplanation of stop codon usageexplanation of variation of amino acid frequenciesorder R10/10 for amino acidsPhase I and II of amino acids

Section 2genomic codon distributioncodon intervals in genomecorrelation coefficient matrixdistance matrix by genomic codon distributions

tree of codons by genomic codon distributions

tree of species by genomic codon distributions

tree of eukaryotes by genomic codon distributions

rough invariance of genomic codon distributionrough periodicity of genomic codon distributiondeclining three-base-periodic fluctuationsamplitudes of genomic codon distributionorder of the average distribution heightsorigin order of domains by slope of regression lineevolutionary status of Virussimulation of genomic codon distributionsexplanation of validity of composition vector treeevolution spacefluctuation (coordination of the evolution space)route bias (coordination of the evolution space)hierarchy bias (coordination of the evolution space)leaf node bias

41


https://doi.org/10.1101/017988

classification of genomic codon distributionsnon-dimensionalized evolution spacedistance matrix by the evolution space

tree of species by the evolution space

tree of taxa by the evolution space

tree of eukaryotes by the evolution space

three-domain tree of lifeeocyte tree of life

Section 3split and reconstruction schemeevolution of life at sequence levelevolution of life at species levelgenome size evolutionPhanerozoic biodiversity curvePhanerozoic climatic curvePhanerozoic eustatic curveexponential growth trend of the biodiversity curveexponential growth trend in genome size evolutionbiodiversity net fluctuation curveclimate gradient curveclimto-eustatic curvelogarithmic normal distribution of genome sizeslogarithmic mean of genome sizeslogarithmic standard deviation of genome sizesoriginal logarithmic mean of genome sizesfactor χsimulation of genome size evolution

interactions among earth’s spheresfive mass extinctionscontrol factors of mass extinctionsmultifactorial explanation of mass extinctionsHaq sea level curveHallam sea level curveclimate indicator curveBerner’s CO2 curvemarine 87S r/86S r curvemarine δ18O curveaverage after nondimensionalizationderivative curvesaggregation and dispersal of Pangaeaeustatic change at tectonic time scaleclimatic change at tectonic time scaleinstant response of biodiversity at tectonic time scaletectonic cause of mass extinctionsoutline of the biodiversity fluctuationstwo phases of P-Tr mass extinctionminor extinctionsradiations and rebounds of biodiversityreconstructed Phanerozoic biodiversity curveexplanation of the declining trend of origination rateexplanation of the declining trend of extinction ratesynchronicity between origination and extinctionconstraint on origination abilityredundant species created at the sequence levelstrategy of adaptation

42


https://doi.org/10.1101/017988

#11G → A

Y1 R1 * R11 Y11CCC

(stop,Ser)10Arg 6Pro

GGG


GGA


CCT


+

+

++

#14G → A

Y11R11*y11→ y11r11*R14 Y14TCC

(stop,Ser)10Arg 7Ser

AGG


TCC


CCT


GGA


AGA


TCT


4+

4+

4+

++

+

+

#1 initial subset

Y R * R1 Y1CCC

1Gly 6Pro

GGG

1Gly 6Pro

GGG

1Gly 6Pro

CCC

1Gly 6Pro

+

+

+

#3 G → A initial subset

Y1 R1 * R3 Y3CCC

1Gly 7Ser

GGG

1Gly 7Ser

AGG

1Gly 7Ser

TCC

1Gly 7Ser

++

+

+

#27G → A

Y4 R4 * R27 Y27CTC

20Lys (Thr)8Leu

GAG

20Lys (Thr)8Leu

GAA

20Lys (Thr)8Leu

CTT

20Lys (Thr)8Leu

+

+

++

#32G → A

Y23 R23 * R32 Y32CTT

(Asn)20Lys 17Phe

GAA

(Asn)20Lys 17Phe

AAA

(Asn)20Lys 17Phe

TTT

(Asn)20Lys 17Phe

++

+

+


Y1 R1 * R4 Y4CCC

3Glu (Thr)8Leu

GGG

3Glu (Thr)8Leu

GAG

3Glu (Thr)8Leu

CTC

3Glu (Thr)8Leu

+

++

+

#23G → A

Y4 R4 * R23 Y23CTC

3Glu 17Phe

GAG

3Glu 17Phe

AAG

3Glu 17Phe

TTC

3Glu 17Phe

++

+

+


Y2 R2 *y2 → y2 r2 *R5 Y5 CCG

4Asp 5Val

GGC

4Asp 5Val

CCG

4Asp 5Val

GCC

4Asp 5Val

CGG

4Asp 5Val

CAG

4Asp 5Val

GTC

4Asp 5Val

4+

4+

+

+

++

+

#26G → A

Y5 R5 *y5 → y5 r5 *R26 Y26CTG

19Asn 5Val

GAC

19Asn 5Val

CTG

19Asn 5Val

GTC

19Asn 5Val

CAG

19Asn 5Val

CAA

19Asn 5Val

GTT

19Asn 5Val

4+

4+

+

+

+

++

#2 G → C initial subset

Y1 R1 * R2 Y2CCC

1Gly 2Ala

GGG

1Gly 2Ala

CGG

1Gly 2Ala

GCC

1Gly 2Ala

4+

+

+

#8 G → A

Y2 R2 *y2 → y2 r2 *R8 Y8 CCG

7Ser 2Ala

GGC

7Ser 2Ala

CCG

7Ser 2Ala

GCC

7Ser 2Ala

CGG

7Ser 2Ala

CGA

7Ser 2Ala

GCT

7Ser 2Ala

4+

4+

+

+

+

++

#21G → A

Y6 R6 *y6 → y6 r6 *R21 Y21CCA

4Asp 15Ile

GGT

4Asp 15Ile

CCA

4Asp 15Ile

ACC

4Asp 15Ile

TGG

4Asp 15Ile

TAG

4Asp 15Ile

ATC

4Asp 15Ile

4+

4+

−

+

++

+

#30G → A

Y21R21*y21→ y21r21*R30 Y30CTA

19Asn 15Ile

GAT

19Asn 15Ile

CTA

19Asn 15Ile

ATC

19Asn 15Ile

TAG

19Asn 15Ile

TAA

19Asn 15Ile

ATT

19Asn 15Ile

4+

4+

−

+

+

++

#6 C → T initial subset

Y2 R2 *y2 → y2 r2 *R6 Y6 CCG

1Gly 9Thr

GGC

1Gly 9Thr

CCG

1Gly 9Thr

GCC

1Gly 9Thr

CGG

1Gly 9Thr

TGG

1Gly 9Thr

ACC

1Gly 9Thr

4+

4+

+

++

+

+

#17G → A

Y6 R6 *y6 → y6 r6 *R17 Y17CCA

7Ser 9Thr

GGT

7Ser 9Thr

CCA

7Ser 9Thr

ACC

7Ser 9Thr

TGG

7Ser 9Thr

TGA

7Ser 9Thr

ACT

7Ser 9Thr

4+

4+

−

+

+

++

#16G → A

Y7 R7 * R16 Y16CGC

9Thr 10Arg

GCG

9Thr 10Arg

GCA

9Thr 10Arg

CGT

9Thr 10Arg

+

+

++

#18G → A

Y16R16*y16→ y16r16*R18 Y18TGC

9Thr 11Cys

ACG

9Thr 11Cys

TGC

9Thr 11Cys

CGT

9Thr 11Cys

GCA

9Thr 11Cys

ACA

9Thr 11Cys

TGT

9Thr 11Cys

4+

+

4+

++

+

+

#7 G → C

Y1 R1 * R7 Y7CCC

2Ala 10Arg

GGG

2Ala 10Arg

GCG

2Ala 10Arg

CGC

2Ala 10Arg

+

4+

+

#9 G → A

Y7 R7 * R9 Y9CGC

2Ala 11Cys

GCG

2Ala 11Cys

ACG

2Ala 11Cys

TGC

2Ala 11Cys

++

+

+

#22G → A

Y19 R19 * R22 Y22CAC

16Met 13His

GTG

16Met 13His

GTA

16Met 13His

CAT

16Met 13His

+

+

++

#29G → A

Y24R24*y24→ y24r24*R29 Y29CAT

(Met,stop)15Ile 18Tyr

GTA


CAT


TAC


ATG


ATA


TAT


4+

−

4+

+

+

++

#19C → T

Y7 R7 * R19 Y19CGC

5Val 13His

GCG

5Val 13His

GTG

5Val 13His

CAC

5Val 13His

+

++

+

#24G → A

Y19 R19 * R24 Y24CAC

5Val 18Tyr

GTG

5Val 18Tyr

ATG

5Val 18Tyr

TAC

5Val 18Tyr

++

+

+

#20G → A

Y10R10*y10→ y10r10*R20 Y20GCC

14Gln (Thr,Ser)8Leu

CGG

14Gln (Thr,Ser)8Leu

GCC

14Gln (Thr,Ser)8Leu

CCG

14Gln (Thr,Ser)8Leu

GGC

14Gln (Thr,Ser)8Leu

GAC

14Gln (Thr,Ser)8Leu

CTG

14Gln (Thr,Ser)8Leu

+

4+

4+

+

++

+

#28G → A

Y20R20*y20→ y20r20*R28 Y28GTC

14Gln 8Leu

CAG

14Gln 8Leu

GTC

14Gln 8Leu

CTG

14Gln 8Leu

GAC

14Gln 8Leu

AAC

14Gln 8Leu

TTG

14Gln 8Leu

+

4+

4+

++

+

+

#10G → C

Y1 R1 * R10 Y10CCC

(stop)10Arg 6Pro

GGG

(stop)10Arg 6Pro

GGC

(stop)10Arg 6Pro

CCG

(stop)10Arg 6Pro

+

+

4+

#13G → A

Y10R10*y10→ y10r10*R13 Y13GCC

10Arg 7Ser

CGG

10Arg 7Ser

GCC

10Arg 7Ser

CCG

10Arg 7Ser

GGC

10Arg 7Ser

AGC

10Arg 7Ser

TCG

10Arg 7Ser

+

4+

4+

++

+

+

#25G → A

Y12R12*y12→ y12r12*R25 Y25ACC

(Gln)stop (Thr)8Leu

TGG

(Gln)stop (Thr)8Leu

ACC

(Gln)stop (Thr)8Leu

CCA

(Gln)stop (Thr)8Leu

GGT

(Gln)stop (Thr)8Leu

GAT

(Gln)stop (Thr)8Leu

CTA

(Gln)stop (Thr)8Leu

−

4+

4+

+

++

+

#31G → A

Y25R25*y25→ y25r25*R31 Y31ATC

(Gln)stop 8Leu

TAG

(Gln)stop 8Leu

ATC

(Gln)stop 8Leu

CTA

(Gln)stop 8Leu

GAT

(Gln)stop 8Leu

AAT

(Gln)stop 8Leu

TTA

(Gln)stop 8Leu

−

4+

4+

++

+

+

#12C → T

Y10R10*y10→ y10r10*R12 Y12GCC

12Trp 6Pro

CGG

12Trp 6Pro

GCC

12Trp 6Pro

CCG

12Trp 6Pro

GGC

12Trp 6Pro

GGT

12Trp 6Pro

CCA

12Trp 6Pro

+

4+

4+

+

+

++

#15G → A

Y12R12*y12→ y12r12*R15 Y15ACC

(Trp,Cys,Sel)stop 7Ser

TGG


ACC


CCA


GGT


AGT


TCA


−

4+

4+

++

+

+

Pro

Gly

Arg(Ser)

Glu

Leu(T

hr)

Ala

Gly

Val

Thr

Arg

Ala

Thr

Val

His

Arg

Pro

Gln,Leu

(Trp)

Phe

Tyr

Ile

Leu(Gln)

Gly

Pro

Hierarchy 1 Hierarchy 2 Hierarchy 3 Hierarchy 4

Ro

ute

0

Ro

ute

1

Ro

ute

2

Ro

ute

3

Ser

Asp

Ser

Cys

Leu(G

ln,T

hr)

Ser

Lys

Asn

(Met)

Leu(G

ln)

Fig 1A The roadmap for the evolution of the genetic code. The 64 codons formed from base substitutions in triplexDNAs are in red. Only three-base-length segments of the triplex DNAs are shown explicitly; the whole lengthright-handed triplex DNAs are indicated in Fig 1B. In each position #n, the #n codon pair on Rn and Yn is in red.The relative stabilities of the base triplexes (-, +, ++, 4+) are written to the right of the base triplexes, where theincreasing relative stabilities of base triplexes in base substitutions are indicated in green. Each triplex DNA is denotedby three arrows, whose directions are from 5’ to 3’. The YR ∗ R triplex DNAs are in pink, and the YR ∗ Y triplex DNAsin azure. The recruitment order of codon pairs are from #1 to #32, and the recruitment order of the 20 amino acids areto the left of them respectively. Non-standard genetic codes are indicated by brackets beside the corresponding aminoacids. The Route 0 − 3 and Hierarchy 1 − 4 are indicated to the right of and below the roadmap respectively. Theevolution of the genetic code are denoted by black arrows, beside which pair connections are indicated by certain aminoacids. Refer to an example in Fig 1B to understand details of the roadmap; refer to Fig S1a1 to understand the criticalrole of relative stabilities of base triplexes in achieving the real genetic code; refer to Fig 2A to see the origin of tRNAs;refer to Fig 2B to see the coherent relationship between the recruitment orders of codons and amino acids; refer to Fig2C to see the codon degeneracy in the symmetric roadmap; and refer to Fig 2D to see the origin of homochirality of life.

44


https://doi.org/10.1101/017988

#11CG*G → CG*A

CCC

GGG

GGA

+

+

++

#14CG*G → CG*A

CCT

GGA

AGA

+++

+

#1

CCC

GGG

GGG

+

+

+

#3 CG*G → CG*A

CCC

GGG

AGG

+++

+

#27CG*G → CG*A

CTC

GAG

GAA

+

+

++

#32CG*G → CG*A

CTT

GAA

AAA

+++

+

#4 CG*G → CG*A

CCC

GGG

GAG

+

+++

#23CG*G → CG*A

CTC

GAG

AAG

+++

+

#5 CG*G → CG*A

GCC

CGG

CAG

+

+++

#26CG*G → CG*A

GTC

CAG

CAA

+

+

++

#2 CG*G → CG*C

CCC

GGG

CGG

4++

+

#8 CG*G → CG*A

GCC

CGG

CGA

+

+

++

#21CG*G → CG*A

ACC

TGG

TAG

+

+++

#30CG*G → CG*A

ATC

TAG

TAA

+

+

++

#6 GC*C → GC*T

GCC

CGG

TGG

+++

+

#17CG*G → CG*A

ACC

TGG

TGA

+

+

++

#16CG*G → CG*A

CGC

GCG

GCA

+

+

++

#18CG*G → CG*A

CGT

GCA

ACA

+++

+

#7 CG*G → CG*C

CCC

GGG

GCG

+

4++

#9 CG*G → CG*A

CGC

GCG

ACG

+++

+

#22CG*G → CG*A

CAC

GTG

GTA

+

+

++

#29CG*G → CG*A

TAC

ATG

ATA

+

+

++

#19GC*C → GC*T

CGC

GCG

GTG

+

+++

#24CG*G → CG*A

CAC

GTG

ATG

+++

+

#20CG*G → CG*A

CCG

GGC

GAC

+

+++

#28CG*G → CG*A

CTG

GAC

AAC

+++

+

#10CG*G → CG*C

CCC

GGG

GGC

+

+

4+

#13CG*G → CG*A

CCG

GGC

AGC

+++

+

#25CG*G → CG*A

CCA

GGT

GAT

+

+++

#31CG*G → CG*A

CTA

GAT

AAT

+++

+

#12GC*C → GC*T

CCG

GGC

GGT

+

+

++

#15CG*G → CG*A

CCA

GGT

AGT

+++

+


Ro

ute

0

Ro

ute

1

Ro

ute

2

Ro

ute

3

CCC

GGG

GGG

+

+

+

CCT

GGA

GGA

+ +

+

CCC

GGG

GGG

+

+

+

CCC

GGG

GGG

+ +

+

CTC

GAG

GAG

+

+

+

CTT

GAA

GAA

+ +

+

CCC

GGG

GGG

+

+ +

CTC

GAG

GAG

+ +

+

GCC

CGG

CGG

+

+ +

GTC

CAG

CAG

+

+

+

CCC

GGG

GGG

+ +

+

GCC

CGG

CGG

+

+

+

ACC

TGG

TGG

+

+ +

ATC

TAG

TAG

+

+

+

GCC

CGG

CGG

+ +

+

ACC

TGG

TGG

+

+

+

CGC

GCG

GCG

+

+

+

CGT

GCA

GCA

+ +

+

CCC

GGG

GGG

+

+ +

CGC

GCG

GCG

+ +

+

CAC

GTG

GTG

+

+

+

TAC

ATG

ATG

+

+

+

CGC

GCG

GCG

+

+ +

CAC

GTG

GTG

+ +

+

CCG

GGC

GGC

+

+ +

CTG

GAC

GAC

+ +

+

CCC

GGG

GGG

+

+

+

CCG

GGC

GGC

+ +

+

CCA

GGT

GGT

+

+ +

CTA

GAT

GAT

+ +

+

CCG

GGC

GGC

+

+

+

CCA

GGT

GGT

+ +

+


Ro

ute

0

Ro

ute

1

Ro

ute

2

Ro

ute

3

Fig S1a1 Explanation of the increasing relative stabilities of base triplexes in base substitutions along theroadmap Fig 1A. The base substitutions in the roadmap occur when the relative stabilities of base triplexesincrease. The roadmap is the best result to avoid the unstable base triplexes. So, the universal genetic codeis a narrow choice by the relative stabilities of base triplexes. The relative stability increases from (+) of thebase triplex CG ∗G to (4+) of the base triplex CG ∗C at #2, #7 and #10 that initiates Route 1− 3 respectively.GC ∗ C (+) changes to GC ∗ T (++) at #6, #19 and #12, and CG ∗ G (+) changes to CG ∗ A (++) at otherpositions in the roadmap.

45


https://doi.org/10.1101/017988

→

→

→

→

→

→

→

→

→

→

→

→

→

→

→

→

→

→

→

→

→

→

→

→

#1 #7 #19 #24 #29G → C C → T G → A G → A

step by step

(The stability of basetriplex increases from




CG*G + to CG*C 4+ ) GC*C + to GC*T ++ ) CG*G + to CG*A ++ ) CG*G + to GC*A ++ )

R

GGG

combine

combine

seperate

re−combine

Y R

CCC

GGG

Y1 R1

CCC

GGG

Y24R24

CAT

GTA

y24 r24

TAC

ATG

Y7 R7

CGC

GCG

Y19R19

CAC

GTG

Y R *R1

CCC

GGG

GGG

+ + +

Y1 R1 *R7

CCC

GGG

GCG

+ 4++

Y24R24*y24

CAT

GTA

CAT

4+− 4+

y24 r24 *R29

TAC

ATG

ATA

+ + ++

Y7 R7 *R19

CGC

GCG

GTG

+ +++

Y19R19*R24

CAC

GTG

ATG

+++ +

Y R R1

andCCC

GGG

GGG

Y1 R1 R7

andCCC

GGG

GCG

Y24 R24 y24

andCAT

GTA

CAT

y24 r24 R29

andTAC

ATG

ATA

Y7 R7 R19

andCGC

GCG

GTG

Y19 R19 R24

andCAC

GTG

ATG

R1 Y1

GGG

CCC

R7 Y7

GCG

CGC

y24r24

CAT

GTA

R29 Y29

ATA

TAT

R19 Y19

GTG

CAC

R24 Y24

ATG

TAC

R

5’

3’

G

G

G

G

G

G

G

G

G

G

G

G

G

Y

3’

5’

C

C

C

C

C

C

C

C

CCC

G GG

G

G

G5’

3’ 5’

3’5’3’

Y R R1

CCC

G GG

5’

3’

3’5’Y R

CCC

G GG

G

G

C5’

3’ 5’

3’5’3’

Y1 R1 R7

CCC

G GG

5’

3’

3’5’Y1 R1

CGC

G CG

G

G

T5’

3’ 5’

3’5’3’

Y7 R7 R19

CGC

G CG

5’

3’

3’5’Y7 R7

CAC

G TG

G

A

T5’

3’ 5’

3’5’3’

Y19R19R24

CAC

G TG

5’

3’

3’5’Y19R19

TAC

A TG

T

C

A5’

3’ 3’

3’5’5’

Y24R24y24

TAC

A TG

5’

3’

3’5’Y24R24

CAT

G TA

A

A

T5’

3’ 5’

3’5’3’

y24r24R29

CAT

G TA

5’

3’

3’5’y24r24

Fig 1B A detailed description of the roadmap in the right-handed triplex DNA picture, taking for examplefrom #1 to #29. The evolution of the genetic code from #1, to #7, to #19, to #24, and at last to #29 areexplained in detail in the upper boxes, and the corresponding right-handed single-stranded, double-strandedand triple-stranded DNAs are shown in the lower boxes, respectively.

46


https://doi.org/10.1101/017988

#11

tRNA t7 tRNA t10

Y11R11*Yt11 R

t11 y11r11*y

t11 r

t11

7Ser 10Arg

TCC

AGG

UCC

AGG

CCT

GGA

CCU

GGA

#14

#1

tRNA t1

Y1 R1 * Yt1 R

t1

1Gly

CCC

GGG

CCC

GGG

#3 #27

tRNA t20 tRNA t17

Y23R23*Yt23 R

t23 y23r23*y

t23 r

t23

20Lys 17Phe

CTT

GAA

CUU

GAA

TTC

AAG

UUC

AAG

#32

#4

tRNA t3

Y4 R4 * Yt4 R

t4

3Glu

CTC

GAG

CUC

GAG

#23

#5

tRNA t4

Y5 R5 *Yt5 R

t5 y5 r5 *y

t5 r

t5

4Asp

CTG

GAC

CUG

GAC

GTC

CAG

GUC

CAG

#26

#2

Y2 R2 *Yt2 R

t2 y2 r2 *y

t2 r

t2

CCG

GGC

CCG

GGC

GCC

CGG

GCC

CGG

#8 #21

tRNA t19

Y21R21*Yt21 R

t21 y21r21*Y

t30 R

t30

19Asn

CTA

GAT

CUA

GAU

ATC

TAG

AUU

UAA

#30

#6

Y6 R6 *Yt6 R

t6 y6 r6 *y

t6 r

t6

CCA

GGT

CCA

GGU

ACC

TGG

ACC

UGG

#17

#16

tRNA t11 tRNA t9

Y16R16*Yt16 R

t16 y16r16*y

t16 r

t16

11Cys 9Thr

TGC

ACG

UGC

ACG

CGT

GCA

CGU

GCA

#18

#7

tRNA t2

Y7 R7 * Yt7 R

t7

2Ala

CGC

GCG

CGC

GCG

#9 #22

tRNA t16 tRNA t18

Y24R24*Yt24 R

t24 y24r24*y

t24 r

t24

16Met 18Tyr

CAT

GTA

CAU

GUA

TAC

ATG

UAC

AUG

#29

#19

tRNA t13

Y19 R19 * Yt19 R

t19

13His

CAC

GTG

CAC

GUG

#24

#20

tRNA t5 tRNA t14

Y20R20*Yt20 R

t20 y20r20*y

t20 r

t20

5Val 14Gln

GTC

CAG

GUC

CAG

CTG

GAC

CUG

GAC

#28

#10

tRNA t6

Y10R10*Yt10 R

t10 y10r10*y

t10 r

t10

6Pro

GCC

CGG

GCC

CGG

CCG

GGC

CCG

GGC

#13 #25

tRNA t15 tRNA t8

Y25R25*Yt25 R

t25 y25r25*y

t25 r

t25

15Ile 8Leu

ATC

TAG

AUC

UAG

CTA

GAT

CUA

GAU

#31

#12

tRNA t12

Y12R12*Yt12 R

t12 y12r12*y

t12 r

t12

12Trp

ACC

TGG

ACC

UGG

CCA

GGT

CCA

GGU

#15

Wobble pairing G:U,C

Wobble pairing U:G,A


Ro

ute

0

Ro

ute

1

Ro

ute

2

Ro

ute

3

Fig 2A The tRNAs originated along the roadmap are able to transport all the canonical amino acids. Eachtriplex nucleic acid YR ∗ Yt or YR ∗ Rt consists of two DNA strands Y and R and one RNA strand Yt or Rt.There are two possible ways for each codon pair to generate two complementary RNA strands (Yt and Rt),which consequently assemble into a tRNA (Fig S2a3).

47


https://doi.org/10.1101/017988

CCC

CUC UCC CCU

CGC GCC CCG

CAC ACC CCA

GUC CUG

UGC CGU

AUC(AUU) CUA

UAC CAU

CUU UUC

GGG

GAG GGA AGG

GCG GGC CGG

GUG GGU UGG

GAC CAG

GCA ACG

GAU UAG

GUA AUG

AAG GAA

t1

t3 t10

t2

t12

t4 t14

t9

t19

t16

t20

t7

t6

t13

t5

t11

t15 t8

t18

t17

Palindromic anti−codon Non−palindromic anti−codon

Y s

tran

dR

stra

nd

Stra

nd

po

sitio

nFig S2a1 Single-strand RNA pairs at the branch nodes in the roadmap Fig 1A can form tRNAs: the tRNAs t1,t10, t3, t17, t2, t9, t13, t18 correspond to the codon pairs in Route 0 and Route 2, and the tRNAs t7, t20, t11,t16 correspond to the reverse codon pairs (considering the reverse operation) in Route 0 and Route 2 (Fig 2A).The codon pairs are reverse to each other between Route 1 and Route 3. The tRNAs t4, t19, t6, t5, t14, t12,t15, t8 correspond to codon pairs or reverse codon pairs in Route 1 and Route 3 (Fig 2A). In total, there are20 tRNAs such obtained that correspond to the codon pairs or their reverse codon pairs at the branch nodes.The anti-codons concerned are arranged in the present figure by comparing between Y strand anti-codonsand R strand anti-codons, as well as by comparing between palindromic anti-codons and non-palindromicanti-codons, and the rows in the present figure indicate the types of genomic codon distributions in Fig S3d1.

48


https://doi.org/10.1101/017988

GGG 1Gly t1

GGA 1Gly

GGC 1Gly

GGU 1Gly

GCG 2Alat2

GCA 2Ala

GCC 2Ala

GCU 2Ala

GAG 3Glu t3

GAA 3Glu

GAC 4Asp t4

GAU 4Asp

GUG 5Val

GUA 5Val

GUC 5Valt5

GUU 5Val

CGG10Arg

CGA10Arg

CGC10Arg

CGU10Arg

CCG 6Prot6

CCA 6Pro

CCC 6Pro

CCU 6Pro

CAG14Gln t14

CAA14Gln

CAC13His t13

CAU13His

CUG 8Leu

CUA 8Leut8

CUC 8Leu

CUU 8Leu

AGG10Arg t10

AGA10Arg

AGC 7Ser

AGU 7Ser

ACG 9Thrt9

ACA 9Thr

ACC 9Thr

ACU 9Thr

AAG20Lys t20

AAA20Lys

AAC19Asn

AAU19Asnt19

AUG 16Mett16

AUA 15Ile

AUC 15Ilet15

AUU 15Ile

UGG12Trp t12

UGA stop

UGC11Cys t11

UGU11Cys

UCG 7Ser

UCA 7Ser

UCC 7Sert7

UCU 7Ser

UAG stop

UAA stop

UAC18Tyr t18

UAU18Tyr

UUG 8Leu

UUA 8Leu

UUC 17Phet17

UUU 17Phe

Fig S2a2 The 20 tRNAs t1-t20 (Fig 2A, S2a1) are able to carry all the 20 canonical amino acids according tothe proper codon assignment in the present figure, where the blue and red codons denote the R- and Y-strandcodons in Fig S2a1 respectively, and the green lines indicate the codon pairs.

49


https://doi.org/10.1101/017988

1Gly t1 Y

t1 R

t1 CCC GGG

5’ 3’

2Ala t2 Y

t7 R

t7 CGC GCG

5’ 3’

3Glu t3 Y

t4 R

t4 CUC GAG

5’ 3’

4Asp t4 y

t5 r

t5 GUC GAC

5’ 3’

5Val t5 R

t20 Y

t20GAC GUC

5’ 3’

6Pro t6 rt10 y

t10CGG CCG

5’ 3’

7Ser t7 R

t11 Y

t11GGA UCC

5’ 3’

8Leu t8 rt25 y

t25UAG CUA

5’ 3’

9Thr t9 y

t16 r

t16CGU ACG

5’ 3’

10Arg t10y

t11 r

t11CCU AGG

5’ 3’

11Cys t11R

t16 Y

t16GCA UGC

5’ 3’

12Trp t12y

t12 r

t12CCA UGG

5’ 3’

13His t13R

t19 Y

t19GUG CAC

5’ 3’

14Gln t14y

t20 r

t20CUG CAG

5’ 3’

15Ile t15R

t25 Y

t25GAU AUC

5’ 3’

16Met t16Y

t24 R

t24CAU AUG

5’ 3’

17Phe t17rt23 y

t23GAA UUC

5’ 3’

18Tyr t18rt24 y

t24GUA UAC

5’ 3’

19Asn t19Y

t30 R

t30AUU AAU

5’ 3’

20Lys t20Y

t23 R

t23CUU AAG

5’ 3’

anti−codon

5’

3’

3’

5’

an

ti−co

do

n

5’

3’

3’

5’

an

ti−co

do

n

5’ 3’

anti−codon

Fig S2a3 The tRNAs t1-t20 with anti-codons (Fig 2A, S2a1, S2a2) are listed here to carry the amino acidsfrom No.1 to No.20 respectively. The two complimentary single-stranded RNAs for each tRNA join together,and fold into a cloverleaf shape by taking advantage of the complementarity between the two strands. Thejoining position of the two strands is near to the 3’ side of the anti-codon loop, which agrees with the positionof introns in tRNA genes in observations. The additional codon in the complementary strand might be locatedaround the variable loop, despite possibly great changes in this loop. This may relate to the paracodon in thetRNAs.

50


https://doi.org/10.1101/017988

#1110Arg

6Pro

#1410Arg

7Ser

#1 1Gly

6Pro

#3 1Gly

7Ser

#2720Lys

8Leu

#3220Lys

17Phe

#4 3Glu

8Leu

#23 3Glu

17Phe

#5 4Asp

5Val

#2619Asn

5Val

#2 1Gly

2Ala

#8 7Ser

2Ala

#21 4Asp

15Ile

#3019Asn

15Ile

#6 1Gly

9Thr

#17 7Ser

9Thr

#16 9Thr

10Arg

#18 9Thr

11Cys

#7 2Ala

10Arg

#9 2Ala

11Cys

#2216Met

13His

#2915Ile

18Tyr

#19 5Val

13His

#24 5Val

18Tyr

#2014Gln

8Leu

#2814Gln

8Leu

#1010Arg

6Pro

#1310Arg

7Ser

#25 stop

8Leu

#31 stop

8Leu

#1212Trp

6Pro

#15 stop

7Ser

Ala Pro Ser Thr Val Leu

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Initial subset


Ro

ute

0

Ro

ute

1

Ro

ute

2

Ro

ute

3

Fig 2B Cooperative recruitment of codons and amino acids along the roadmap. The codon pairs are plottedfrom left to right according to their recruitment order. The initial subset plays a crucial role in the expansionof the genetic code. The 6 biosynthetic families of the amino acid are distinguished by different colours.

51


https://doi.org/10.1101/017988

GGG GCC GAG GAC GUC CCC UCC CUC ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#1

GGC GCC GAG GAC GUC CCG UCC CUC ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#2

GGA GCG GAG GAC GUC CCG UCC CUC ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#3

GGU GCG GAG GAC GUC CCG AGC CUC ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#4

GGU GCG GAA GAC GUC CCG AGC CUG ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#5

GGU GCG GAA GAU GUG CCG AGC CUG ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#6

GGU GCG GAA GAU GUG CCG AGC CUG ACG CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#7

GGU GCU GAA GAU GUG CCG AGC CUG ACG CGG UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#8

GGU GCA GAA GAU GUG CCG UCG CUG ACG CGG UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#9

GGU GCA GAA GAU GUG CCG UCG CUG ACG CGG UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#10

GGU GCA GAA GAU GUG CCU UCG CUG ACG AGG UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#11

GGU GCA GAA GAU GUG CCA UCG CUG ACG CGA UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#12

GGU GCA GAA GAU GUG CCA UCG CUG ACG CGA UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#13

GGU GCA GAA GAU GUG CCA UCU CUG ACG AGA UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#14

GGU GCA GAA GAU GUG CCA UCA CUG ACG CGU UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#15

GGU GCA GAA GAU GUG CCA AGU CUG ACG CGU UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UAG#16

GGU GCA GAA GAU GUG CCA AGU CUG ACU CGU UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UAG#17

GGU GCA GAA GAU GUG CCA AGU CUG ACA CGU UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UAG#18

GGU GCA GAA GAU GUG CCA AGU CUG ACA CGU UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UAG#19

GGU GCA GAA GAU GUA CCA AGU CUG ACA CGU UGU UGG CAU CAG AUC AUG UUC UAC AAC AAG UAG#20

GGU GCA GAA GAU GUA CCA AGU CUA ACA CGU UGU UGG CAU CAA AUC AUG UUC UAC AAC AAG UAG#21

GGU GCA GAA GAU GUA CCA AGU CUA ACA CGU UGU UGG CAU CAA AUA AUG UUC UAC AAC AAG UAG#22

GGU GCA GAA GAU GUA CCA AGU CUA ACA CGU UGU UGG CAU CAA AUA AUG UUC UAC AAC AAG UAG#23

GGU GCA GAA GAU GUA CCA AGU CUA ACA CGU UGU UGG CAU CAA AUA AUG UUU UAC AAC AAG UAG#24

GGU GCA GAA GAU GUU CCA AGU CUA ACA CGU UGU UGG CAU CAA AUA AUG UUU UAU AAC AAG UAG#25

GGU GCA GAA GAU GUU CCA AGU CUU ACA CGU UGU UGG CAU CAA AUA AUG UUU UAU AAC AAG UAA#26

GGU GCA GAA GAU GUU CCA AGU CUU ACA CGU UGU UGG CAU CAA AUA AUG UUU UAU AAU AAG UAA#27

GGU GCA GAA GAU GUU CCA AGU UUG ACA CGU UGU UGG CAU CAA AUA AUG UUU UAU AAU AAA UAA#28

GGU GCA GAA GAU GUU CCA AGU UUA ACA CGU UGU UGG CAU CAA AUA AUG UUU UAU AAU AAA UAA#29

GGU GCA GAA GAU GUU CCA AGU UUA ACA CGU UGU UGG CAU CAA AUU AUG UUU UAU AAU AAA UAA#30



1Gly 2Ala 3Glu 4Asp 5Val 6Pro 7Ser 8Leu 9Thr 10Arg 11Cys 12Trp 13His 14Gln 15Ile 16Met 17Phe 18Tyr 19Asn 20Lys stop

Fig S2b1 The recruitment orders of the codon pairs and the amino acids in the roadmap Fig 1A are listedin the left column and top row respectively. The codons that encode each amino acid are listed in columnsrespectively, where the row positions for the degenerate codons are determined by the corresponding codonsin the first column of codon pairs from #1 to #32 (blue codons), and the other vacant positions in eachcolumn are filled by the nearby degenerate codons respectively. These blue codons are aligned from up-leftto down-right, which shows that the later the codon pairs recruited, the later the corresponding amino acidsrecruited.

52


https://doi.org/10.1101/017988

0.3 0.4 0.5 0.60

0.2

0.4

0.6

0.8

1

GC content

co

do

n p

ositio

n−

sp

ecific

GC

co

nte

nt

1st codon position

2nd codon position

3rd codon position

Fig S2b2 It is observed that the GC content in the third codon position decreases more rapidly than thatin the first and second base positions when the total GC content decreases, and the GC content in the firstcodon position is generally greater than that in the second codon position. Such a pattern of variation of thebase position specific GC contents in observations can be explained by the roadmap. According to the ordersof degenerate codons for each amino acid in Fig S2b1, the total GC content and the GC contents at 1st-,2nd-, 3rd-codon-position at stages from #1 to #32 can be obtained by counting and calculating in each rowof Fig S2b1, respective. Consequently, the relationships between the total GC content and the base positionspecific GC contents are obtained respectively. Even some detailed features in observation are reproducedin the present figure based on the roadmap Fig 1A, such as the closer distance between 1st and 2nd codonposition GC content at low total GC content side. Such an explanation indicates that the recruitment orderof the codons from #1 to #32 obtained based on the roadmap Fig 1A is considerably reasonable, and henceconfirms the validity of the roadmap theory.

53


https://doi.org/10.1101/017988

0.3 0.4 0.5 0.60

0.2

0.4

0.6

0.8

1

GC content

fre

qu

en

cie

s o

f sto

p c

od

on

s

UGA

UAG

UAA

Fig S2b3 It is observed that the stop codons vary in certain pattern when the total GC content decreases basedon genomic data. Namely, along the evolutionary direction of the decreasing total GC content, the first stopcodon UGA in the roadmap decrease greatly, the second codon UAG in the roadmap decrease slightly, and thethird stop codon UAA increase greatly. Such a pattern in observations can be explained by the roadmap. Therelationships between the total GC content and the frequencies of the three stop codons are obtained basedon the roadmap Fig 1A, respectively, by counting and calculating similarly in Fig S2b2, where the variationranges for the three stop codons have been adjusted according to the observations. A detailed feature inobservations, namely the respective downward and upward leaps of UGA and UAA at the middle total GCcontent, is reproduced in the present figure based on the counts in Fig S2b1. Such an explanation indicatesthat the recruitment order of the three stop codons obtained based on the roadmap Fig 1A is considerablyreasonable, and hence confirms the validity of the roadmap theory.

54


https://doi.org/10.1101/017988

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

variation o

f th

e a

min

o a

cid

fre

quencie

s for

the s

pecie

s in R

10/1

0 e

volu

tionary

ord

er

Gly Ala Glu Asp Val Pro Ser Leu Thr Arg Cys Trp His Gln Ile Met Phe Tyr Asn Lys

Fig S2b4 Explanation of the variation of the amino acid frequencies. The 20 amino acids are arranged inthe recruitment order in the roadmap Fig 1A. The 20 amino acid frequencies for each of the 803 species areobtained respectively based on the genomic data in NCBI. And the 803 amino acid frequencies (green dots)for each of the 20 amino acids are all arranged properly in the R10/10 order, respectively. The variation trendof the amino acid frequencies for each of the 20 amino acids is obtained by the regression line (denoted inred). Generally speaking, the variation trends for the earlier amino acids tend to decrease, and the variationtrends for the latecomers to increase (Fig S2b4).

55


https://doi.org/10.1101/017988

−12

−6

0

6

12

slo

pe o

f th

e v

ariation o

f am

ino a

cid

fre

quencie

s (

10

−5)

Gly Ala Glu Asp Val Pro Ser Leu Thr Arg Cys Trp His Gln Ile Met Phe Tyr Asn Lys

Fig S2b5 Increasing variation rates of the amino acid frequencies. The slope of the regression line of variationof amino acid frequencies for each amino acid is obtained based on the data in Fig S2b4. It is observed inthe present figure that the slopes vary from negative to positive in general according to the recruitment orderof amino acids in the roadmap. Such a natural result indicates that the recruitment order of the amino acidsfrom No.1 to No.20 obtained based on the roadmap Fig 1A is considerably reasonable, and hence confirmsthe validity of the roadmap theory.

56


https://doi.org/10.1101/017988

G

T

G

C

A

C

#19

Val His

IA/m IIA/M<G> <C>

N_branch

G

T

A

C

A

T

#22

Met His

IA/m IIA/M<A> <T>

N_leaf

G

C

G

C

G

C

#7

Ala Arg

IIA/M IA/m<G> <C>

N_branch

G

C

A

C

G

T

#16

Thr Arg

IIA/M IA/m<G> <C>

N_branch

A

T

G

T

A

C

#24

Val Tyr

IA/m IC/M<A> <T>

N_branch

A

T

A

T

A

T

#29

Ile TyrMet,stp

IA/m IC/M<A> <T>

N_leaf

A

C

G

T

G

C

#9

Ala Cys

IIA/M IA/m<G> <C>

N_leaf

A

C

A

T

G

T

#18

Thr Cys

IIA/M IA/m<A> <T>

N_leaf

G

A

G

C

T

C

#4

Glu Leu Thr

IB/m IA/m<G> <C>

N_branch

G

A

A

C

T

T

#27

Lys Leu Thr

IIB/m IA/m<A> <T>

N_leaf

G

G

G

C

C

C

#1

Gly Pro

IIA/M IIA/M<G> <C>

N_branch

G

G

A

C

C

T

#11

Arg Prostp,Ser

IA/M IIA/M<G> <C>

N_branch

A

A

G

T

T

C

#23

Glu Phe

IB/m IIC/m<A> <T>

N_branch

A

A

A

T

T

T

#32

Lys Phe Asn

IIB/m IIC/m<A> <T>

N_leaf

A

G

G

T

C

C

#3

Gly Ser

IIA/M IIA/M<G> <C>

N_leaf

A

G

A

T

C

T

#14

Arg Serstp,Ser

IA/M IIA/M<A> <T>

N_leaf

G

G

T

C

C

A

#12

Trp Pro

IC/m IIA/M<G> <C>

N_branch

G

A

T

C

T

A

#25

stp LeuGln,Pyl Thr

IA/m<A> <T>

N_branch

G

G

C

C

C

G

#10

Arg Pro stp

IA/m IIA/M<G> <C>

N_branch

G

A

C

C

T

G

#20

Gln Leu Thr,Ser

IB/m IA/m<G> <C>

N_branch

A

G

T

T

C

A

#15

stp SerTrp,Cys

IIA/M<A> <T>

N_leaf

A

A

T

T

T

A

#31

stp Leu Gln

IA/m<A> <T>

N_leaf

A

G

C

T

C

G

#13

Arg Ser

IA/m IIA/M<G> <C>

N_leaf

A

A

C

T

T

G

#28

Gln Leu

IB/m IA/m<A> <T>

N_leaf

T

G

G

A

C

C

#6

Gly Thr

IIA/M IIA/M<G> <C>

N_branch

T

A

G

A

T

C

#21

Asp Ile

IIB/M IA/m<A> <T>

N_branch

C

G

G

G

C

C

#2

Gly Ala

IIA/M IIA/M<G> <C>

N_branch

C

A

G

G

T

C

#5

Asp Val

IIB/M IA/m<G> <C>

N_branch

T

G

A

A

C

T

#17

Ser Thr

IIA/M IIA/M<A> <T>

N_leaf

T

A

A

A

T

T

#30

Asn Ile

IIB/M IA/m<A> <T>

N_leaf

C

G

A

G

C

T

#8

Ser Ala

IIA/M IIA/M<G> <C>

N_leaf

C

A

A

G

T

T

#26

Asn Val

IIB/M IA/m<A> <T>

N_leaf

#19−Val−#24

#22−Met−#29

#7−Ala−#9

#16−Thr−#18

#19−His−#22

#7−Arg

−#16

#24−Tyr−#29

#9−Cys

−#18

#4−Glu−#23

#27−Lys−#32

#1−Gly−#3*

#11−Arg−#14

#4−Leu−#27

#1−Pro

−#11

#23−Phe−#32

#3−Ser−

#14

#1

0−

Pro

−#

12

#2

0−

Le

u−

#2

5

#1

3−

Se

r−#

15

#2

8−

Le

u−

#3

1

#12−Trp−#15

#25−stp−#31

#10−Arg−#13

#20−Gln−#28

#2

−G

ly−

#6

*

#5

−A

sp

−#

21

#8

−S

er−

#1

7

#2

6−

Asn

−#

30

#6−Thr−#17

#21−Ile−#30

#2−Ala−#8

#5−Val−#26

Route 2

Route 0

Route 3

Route 1

biosynthetic

families

Glu

Asp

Val

Ser

Phe

initial/stop

aaRS types

IIA,IIB,IIC

IA,IB,IC

M: major groove

m: minor groove

genomic codon

distribution types

<G>,<C>,<A>,<T>

pair connections

#1−Gly−#3 etc.

Colored ones indicate

route duality

node types

N_branch: branch node

N_leaf: leaf node

facets

w es

n

d

u

Fig 2C Explanation of the codon degeneracy based on the relationships among codons (pair connectionsand route dualities) in the roadmap. This is a revised plot of the roadmap Fig 1A to indicate the symmetryin the evolution of the genetic code, where the four routes are represented by four cubes respectively. Pairconnections are marked besides the evolutionary arrows in the roadmap. Route dualities are indicated bysame colours for the corresponding pair connections. The biosynthetic families of amino acids are denotedby coloured semicircles. The types of the aminoacyl-tRNA synthetases are besides the codons. Branch nodesand leaf nodes are distinguished. The 6 facets for each route are indicated on a cube at bottom right.

57


https://doi.org/10.1101/017988

G

G

G

A

C

U

C

C

G

A

C

U

A

A

G

A

C

U

U

U

G

A

C

U

GGG Route 0 R

GGA Route 0 R

GGC Route 1 R

GGU Route 1 R

GC G Route 2 R

GC A Route 2 R

GC C Route 1 Y

GC U Route 1 Y

GA G Route 0 R

GA A Route 0 R

GA C Route 1 R

GA U Route 1 R

GU G Route 2 R

GU A Route 2 R

GU C Route 1 Y

GU U Route 1 Y

C GG Route 3 R

C GA Route 3 R

C GC Route 2 Y

C GU Route 2 Y

C C G Route 3 Y

C C A Route 3 Y

C C C Route 0 Y

C C U Route 0 Y

C A G Route 3 R

C A A Route 3 R

C A C Route 2 Y

C A U Route 2 Y

C U G Route 3 Y

C U A Route 3 Y

C U C Route 0 Y

C U U Route 0 Y

A GG Route 0 R

A GA Route 0 R

A GC Route 1 R

A GU Route 1 R

A C G Route 2 R

A C A Route 2 R

A C C Route 1 Y

A C U Route 1 Y

A A G Route 0 R

A A A Route 0 R

A A C Route 1 R

A A U Route 1 R

A U G Route 2 R

A U A Route 2 R

A U C Route 1 Y

A U U Route 1 Y

U GG Route 3 R

U GA Route 3 R

U GC Route 2 Y

U GU Route 2 Y

U C G Route 3 Y

U C A Route 3 Y

U C C Route 0 Y

U C U Route 0 Y

U A G Route 3 R

U A A Route 3 R

U A C Route 2 Y

U A U Route 2 Y

U U G Route 3 Y

U U A Route 3 Y

U U C Route 0 Y

U U U Route 0 Y

Fig S2c1 The distribution of codons from R- and Y-strands of Route 0 − 3 in the GCAU genetic code table.The pattern of the 4×4 codon boxes for the degenerate codons relates to such a distribution of the four routes,due to the evolution of the genetic code along the roadmap.

58


https://doi.org/10.1101/017988

G

G

G

A

C

U

C

C

G

A

C

U

A

A

G

A

C

U

U

U

G

A

C

U

GGG#1 No. 1Gly

GGA#3 No. 1Gly

GGC#2 No. 1Gly

GGU#6 No. 1Gly

GCG#7 No. 2Ala

GCA#9 No. 2Ala

GCC#2 No. 2Ala

GCU#8 No. 2Ala

GAG#4 No. 3Glu

GA A#23 No. 3Glu

GAC#5 No. 4Asp

GA U#21 No. 4Asp

GUG#19 No. 5Val

GUA#24 No. 5Val

GUC#5 No. 5Val

GUU#26 No. 5Val

CGG#10 No.10Arg

CGA#13 No.10Arg

CGC#7 No.10Arg

CGU#16 No.10Arg

C CG#10 No. 6Pro

C CA#12 No. 6Pro

C CC#1 No. 6Pro

C CU#11 No. 6Pro

CAG#20 No.14Gln

CA A#28 No.14Gln

CAC#19 No.13His

CA U#22 No.13His

C UG#20 No. 8Leu

C UA#25 No. 8Leu

C UC#4 No. 8Leu

C UU#27 No. 8Leu

AGG#11 No.10Arg

A GA#14 No.10Arg

A GC#8 No. 7Ser

A GU#17 No. 7Ser

A CG#16 No. 9Thr

A CA#18 No. 9Thr

A CC#6 No. 9Thr

A CU#17 No. 9Thr

AAG#27 No.20Lys

A A A#32 No.20Lys

A A C#26 No.19Asn

AAU#30 No.19Asn

A UG#22 No.16Met

A UA#29 No.15Ile

A UC#21 No.15Ile

A UU#30 No.15Ile

UGG#12 No.12Trp

UGA#15 stop

UGC#9 No.11Cys

UGU#18 No.11Cys

U CG#13 No. 7Ser

U CA#15 No. 7Ser

U CC#3 No. 7Ser

U CU#14 No. 7Ser

UA G#25 stop

UA A#31 stop

UAC#24 No.18Tyr

UA U#29 No.18Tyr

U UG#28 No. 8Leu

U UA#31 No. 8Leu

U UC#23 No.17Phe

U UU#32 No.17Phe

biosynthetic families: Glu Asp Val Ser Phe others

Fig S2c2 The clusterings of biosynthetic families (Glu Asp Val Ser Phe) in the GCAU genetic code table.Such nice clusterings are correspondingly observed in the R- and Y-strands of Route 0− 3 in Fig 2C (denotedin the same group of colour as in the present figure). The clusterings of biosynthetic families in the presentfigure are closely related to the distribution of codons from R- and Y-strands of Route 0 − 3 in Fig S2c1,due to the recruitment of amino acids along the roadmap. Generally speaking, the amino acids are arrangedproperly in the recruitment order from No.1 to No.20 along the direction from G, C to A, U in the GCAUgenetic code table.

59


https://doi.org/10.1101/017988

G

G

G

A

C

U

C

C

G

A

C

U

A

A

G

A

C

U

U

U

G

A

C

U

GGG Gly IIA m

GGA Gly IIA m

GGC Gly IIA m

GGU Gly IIA m

GCG Ala IIA M

GCA Ala IIA M

GCC Ala IIA M

GCU Ala IIA M

GAG Glu IB m

GAA Glu IB m

GAC Asp IIB M

GAU Asp IIB M

GUG Val IA m

GUA Val IA m

GUC Val IA m

GUU Val IA m

CGG Arg IA m

CGA Arg IA m

CGC Arg IA m

CGU Arg IA m

CCG Pro IIA M

CCA Pro IIA M

CCC Pro IIA M

CCU Pro IIA M

CAG Gln IB m

CAA Gln IB m

CAC His IIA M

CAU His IIA M

CUG Leu IA m

CUA Leu IA m

CUC Leu IA m

CUU Leu IA m

AGG Arg IA m

AGA Arg IA m

AGC Ser IIA m

AGU Ser IIA m

ACG Thr IIA M

ACA Thr IIA M

ACC Thr IIA M

ACU Thr IIA M

AAG Lys IIB m

AAA Lys IIB m

AAC Asn IIB M

AAU Asn IIB M

AUG Met IA M

AUA Ile IA M

AUC Ile IA M

AUU Ile IA M

UGG Trp IC m

UGA stp m

UGC Cys IA m

UGU Cys IA m

UCG Ser IIA M

UCA Ser IIA M

UCC Ser IIA M

UCU Ser IIA M

UAG stp m

UAA stp m

UAC Tyr IC M

UAU Tyr IC M

UUG Leu IA M

UUA Leu IA M

UUC Phe IIC M

UUU Phe IIC M

Fig S2c3 The distribution of types of aminoacyl-tRNA synthetases in the GCAU genetic code table. Theaminoacyl-tRNA synthetases can be divided into Class II and Class I, which can be divided into subclassesIIA, IIB, IIC, and IA, IB, IC, respectively. And the aminoacyl-tRNA synthetases can also be divided intominor groove ones (m) and major groove ones (M).

60


https://doi.org/10.1101/017988

#11 #14

#27 #32

#23

#26

#8 #21 #30

#17

#16 #18

#7 #9 #22 #29

#19 #24

#20 #28

#10 #13 #25 #31

#12 #15

CCC

3’

5’

GGG

5’

3’

GGG

3’

5’

CCC

5’

3’

CCC

3’

5’

GGG

5’

3’

AGG

3’

5’

TCC

5’

3’

CCC

3’

5’

GGG

5’

3’

GAG

3’

5’

CTC

5’

3’

CCC

3’

5’

GGG

5’

3’

CGG

3’

5’

GCC

5’

3’

CCG

3’

5’

GGC

5’

3’

CCG

5’

3’

GCC

3’

5’

CGG

5’

3’

CAG

3’

5’

GTC

5’

3’

CCG

3’

5’

GGC

5’

3’

CCG

5’

3’

GCC

3’

5’

CGG

5’

3’

TGG

3’

5’

ACC

5’

3’

#1

#3

#4

#2

#5

#6

11#14#

27#32#

23#

26#

8#21#30#

17#

16#18#

7# 9#22#29#

19#24#

20#28#

10#13#25#31#

12#15#

CCC

3’

5’

GGG

5’

3’

GGG

3’

5’

CCC

5’

3’

CCC

3’

5’

GGG

5’

3’

AGG

3’

5’

TCC

5’

3’

CCC

3’

5’

GGG

5’

3’

GAG

3’

5’

CTC

5’

3’

CCC

3’

5’

GGG

5’

3’

CGG

3’

5’

GCC

5’

3’

CCG

3’

5’

GGC

5’

3’

CCG

5’

3’

GCC

3’

5’

CGG

5’

3’

CAG

3’

5’

GTC

5’

3’

CCG

3’

5’

GGC

5’

3’

CCG

5’

3’

GCC

3’

5’

CGG

5’

3’

TGG

3’

5’

ACC

5’

3’

1#

3#

4#

2#

5#

6#

Left−handed roadmap

left−handed triplex nucleic acids

D−amino acid, L−ribose

pyrimidine, purine

Right−handed roadmap

right−handed triplex nucleic acids

D−ribose, L−amino acid

purine, pyrimidine

→ ←

non−chirality

point

Fig 2D The homochirality of life originated in a winner-take-all game between the right-handed roadmap andthe left-handed roadmap. The right handed roadmap is just the roadmap in Fig 1A (only the initial subset areshown explicitly here), whose chirally symmetric partner roadmap, i. e., the left-handed roadmap, is plottedon the left. Ref to Fig S2d1 to see a heuristic model on the origin of the homochirality of life.

61


https://doi.org/10.1101/017988

#11#

#22#

#33#

#44#

#55#

#66#

#77#

#88#

#99#

#1010#

#1111#

#1212#

#1313#

#1414#

#1515#

#1616#

#1717#

#1818#

#1919#

#2020#

#2121#

#2222#

#2323#

#2424#

#2525#

#2626#

#2727#

#2828#

#2929#

#3030#

#3131#

#3232#

Right−handed roadmapLeft−handed roadmap

fulfillment of the right−handed roadmap

uncompleted left−handed roadmap

(with competence)

uncompleted left−handed roadmap

(without competence)

↔competing for bases →

subs

titu

tion

←substitution

Fig S2d1 Simulation of the origin of homochirality of life. The roadmap in Fig 1A is a right-handed roadmap,whose chirally symmetric counterpart, i. e. the left-handed roadmap, can also be achieved in theory (Fig2D). In this simulation, a constant substitution rate is assumed for the substitutions from #1 to #32 for theright-handed roadmap and from 1# to 32# for the left-handed roadmap. Which chirality to be selected isabsolutely by chance. The programme can be run for many times. The two chiral roadmaps are obtainedequally. When any chiral roadmap is fulfilled (such as the right-handed roadmap in the present figure),the chirally opposite roadmap may reach a certain uncompleted stage. The simulation indicates that thecompetition for the non-chiral bases between the two chiral roadmaps may enlarge the gap between thefulfilled chiral roadmap and the uncompleted chiral roadmap.

62


https://doi.org/10.1101/017988

AA

A-H

4

T

TT

-H4

AA

T-H

4 A

TT

-H4

A

TA

-H4

TA

T-H

4

TA

A-H

4

TTA-H

4

A

AC

-H3

GTT-H

3

CA

A-H

3

TTG

-H3

A

CA

-H3

TGT-H3

TCA

-H3

TGA-H3

ACT-H3

AGT-H3

GTA-H3 TAC-H

3

CTA-H3TA

G-H

3

AGA-H3

TC

T-H

3CTC

-H2

GA

G-H

2

AA

G-H

3

C

TT

-H3G

AA

-H3T

TC

-H3

AT

C-H

3

GA

T-H

3

AT

G-H

3

CA

T-H

3A

CC

-H2

GG

T-H

2

CC

A-H

2

TG

G-H

2

CA

C-H

2GT

G-H

2

A

GC

-H2

GC

T-H

2

GC

A-H

2TGC-H2

CAG-H2

CTG-H2

ACG-H2

CGT-H2

CGA-H2

TCG-H2

GAC-H2

GT

C-H

2

CCG-H1

CG

G-H

1

GCC-H

1

GG

C-H

1

CG

C-H

1G

CG

-H1

CC

C-H

1

GG

G-H

1

AG

G-H

2 CC

T-H

2 GG

A-H

2TCC-H

2

Fig 3A The tree of codons based on the genomic codon distributions, which agrees with the differentiationof the four hierarchies Hierarchy 1 − 4 (H1-4) in the roadmap. This tree of codons is plotted based on thedistance matrix for codons by averaging the correlation coefficients of genomic codon distributions amongthe species. Both the tree of codons in this figure and the tree of species in Fig S3a5 are obtained based on asame set of genomic codon distributions.

63


https://doi.org/10.1101/017988

G G G G G G G G G G G G G G G G G G G G G G G G G G G

# 1 # 1 # 1 # 1 # 1 # 1 # 1 # 1 # 1

A G G G G C G G G G G A G C G G G G G G G C G G G A G

# 11 # 2 # 1 # 3 # 7 # 1 # 1 # 10 # 4

A G A G A C G G G G G A G T G G G G G G C T G G G A A

# 14 # 5 # 1 # 3 # 19 # 1 # 2 # 12 # 23

A G A A A C G A G G G A G T A G G G A G C T A G A A A

# 14 # 26 # 4 # 3 # 24 # 1 # 8 # 25 # 32

A G A A A C A A G G G A A T A A G G A G C T A A A A A

# 14 # 18 # 23 # 31 # 8 # 31 # 32

# 16 # 4 # 25

FROM the initial subset (GC content for the initial subset: GC%=13/(3*6)=72%)

TO the leaf nodes (GC content for the leaf nodes: GC%=16/(3*16)=33%)

Fig S3a1 The base substitutions along the roadmap may occur randomly at any places in a strand accordingto the roadmap Fig 1A. For a simple example, starting from Poly G in the present figure, the strand mayevolve randomly step by step according to the base substitutions in the triplex DNA along the roadmap. Afterseveral steps, a certain sequence of strand is achieved. Such base substitutions essentially determine thegenomic codon distributions for species (Fig S3a3). The GC content for the initial subset (Fig 1A) is about72%, and the GC content for the leaf nodes about 33%. This range of GC content generally agrees with therange of GC content in observations (Fig S3a2).

64


https://doi.org/10.1101/017988

0 10 20 30 40 50 60 70 80 90 10010

0

101

102

103

GC content (percentage)

GC

co

nte

nt

dis

trib

utio

ns f

or

do

ma

ins

Bacteria

Euryarchaeota

Crenarchaeota

Eukarya

Fig S3a2 GC content ranges for taxa, which can be explained by the evolution of sequences based on theroadmap (Fig S3a1). The origin order of domains (obtained in Fig 3C) is irrelevant to the order of the averageGC contents in taxa here.

65


https://doi.org/10.1101/017988

An example for calculating genomic codon distributions: >gi|71658851|ref|NC_007220.1| Duck circovirus, complete genome1 2 3 4 5 6 7 8 9 1011 1213 141516 171819 202122 2324 252627 282930 3132 333435 363738 394041 4243 444546 474849 5051 525354 555657 585960 6162 636465 666768 6970 717273 747576 777879 8081 828384 858687 8889 909192 939495 969798 99100

1

101

201

301

401

501

601

701

801

901

1001

1101

1201

1301

1401

1501

1601

1701

1801

1901

A CCGGCGCT T GT A CT CCGT A CT CCGGGCCA A GGGGA T A GCA A CT GCA A T GGCGA A GA GCGGCA A CT A CT CA T A CA A GA GGT GGGT CT T T A CCA T T A A T A ACCCT A CCT T T GA A GA CT A T GT T CA T GT T CT A GA GT T T T GCA CGCT CGA CA A T T GCA A GT T T GCT A T CGT CGGCGA GGA GA A GGGCGCGA A CGGCA CGCCTCA CCT T CA GGGA T T CCT GA A CCT CCGA A GT A A T GCGCGA GCT GCCGCCCT T GA GGA GT CGCT GGGA GGA A GA GCCT GGCT CT CT CGT GCCCGGGGA T CT GA CGA A GA CA A T GA A GA A T A T T GCGCCA A A GA GT CGA CA T A CCT T CGA GT T GGT GA A CCA GT CT CCA A GGGT CGA T CCT CT GA T CT GGCCGA A GCGA CA T CCGCT GT GA T GGCT GGT GT GCCGCT A A CT GA GGT GGCCCGGA A GT T CCCCA CGA CT T A T GT T A T CT T T GGGCGT GGCCT GGA A CGCCT CCGT CA CCT GA T CGT CGA GA CGCA A CGT GA T T GGA A GA CCGA A GT CA T CGT T CT GA T T GGT CCGCCCT GCA CCGGA A A GA GCCGT T A T GCA T T T GA A T T T CCCGCCGA A A A CAA GT A T T A CA A A CCA CGCGGGA A GT GGT GGGA CGGT T A CT CGGGA A A T GA CGT A GT CGT CA T GGA CGA CT T T T A T GGT T GGT T GCCGT A T GA T GA T T T GCTGA GA A T T A CCGA CCGT T A CCCGT T GCGGGT T GA A T T T A A A GGCGGT A T GA CT CA GT T T GT GGCT A A A A CA T T GA T CA T A A CA A GT A A CCGCGA GCCT CGTGA T T GGT A CA A GA GCGA GT T T GA CCA T T CA GCT CT GT A CCGCA GA A T T A A CA A GT A T CT T GT GT A T A A T A T CGA CA A GT A T GA A CCCGCCCA GGCA T GT ACCCT CCCGT T CCCA A T A A A CT A CT GA GA CGA A A T T GCCGT T A GT CGT T T A T T A A GGT A A A CA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGTT A GGCA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CA GT T A A GGT T A GGCA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGT T A GGCA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGT T A GA A CA A GA T CT CGT T T A T GCCT A GA A T CCA GT GA A CT GT CCA A A T T T GA T A T A A A A T GT GA A CT GGGCCT CT A T GT CGT A T T GT GCA T T CA CA CCT GT T GT GT T GT CCGGCT T CCGT A GA CT CA T GCCCA T GCCGT A A T GCA CT A CT T GA T CT T GA GT GCA GT T GAT CCA GGT GA A A T T T T T GCCT CCGA GGA A GA A GGT A GA GT GCT T CGT A CCT T CA CCCGCT CCT T GT A T GA T GGGT T T A GGT A T GA A GT A CCGT CT GT GCA CT CT GGA GGGGT CCCA CGT T CT CCT GCT GCT GGT T GA GCCGT A CGGGT CT A T GGA CCA T CCCGT CGT GGA T GT CT T CA CT A T T T GCCCGT CCT T GT CT A T CA CCGT T GA GCCT T GA GT CT T GCT CT T CT GGT A A A T GT T GT A T GCCGGT CT GA GA GT T A T GGCT A CCCCCT T GA T CA T GT A GT A A T CGT A A GGGT A GT GGTA GA A CGGT GT T GA GCCGGT CA T GT A GCT GT T CGT GT CCCCGA A CA T CGCCCA T CT CA T GT T T A A GCCGCA T A T A T T GT T GCCA CGGGCCGGT GGGT CT GTGT A T T GT GCGCCA T CT T CT A A GCT T A A GCT T T GCCA T T T GCCT GCT GCA GT CGGT CCT GT GGT GGA GCCGA A A A A GCCA A A T A CGGT GT T CCT CGT GA CCT T A T A A GT CA CT A CGGA A A A CCGT CGT CGGGGT CT T GCGA T T CGT A GCCT T CGT CT T CT GA A T CT CCT A CGT A GA CCCCGCCT CT T CCT CCGA CCA CGA TA T GCA CGCCGA T A GGT GCGGCCGCGCA T CGGCA A CGT GGGCA GT GA CT T CCT GT A A GA T A T A A A A GA CCGCA CA A GCGCCGT T A T A T T A T T

The codon positions in the above genome for GCA (green) and TAT (red) are as follows respectively:

39 45 61 139 154 193 509 556 576 841 894 967 985 1004 1011 1029 1048 1055 1073 1092 1099 1117 1219 1274 1293 1397 1668 1747 1903 1925 1931 1940 1970

117 164 318 456 461 573 603 672 687 746 855 864 869 879 949 973 1017 1061 1105 1150 1184 1206 1213 1365 1380 1449 1479 1497 1540 1557 1671 1673 1702 1802 1900 1959 1983 1985 1988

Hence, the genomic codon intervals for GCA (green) and TAT (red) are as follows respectively:

6 16 78 15 39 316 47 20 265 53 73 18 19 7 18 19 7 18 19 7 18 102 55 19 104 271 79 156 22 6 9 30

47 154 138 5 112 30 69 15 59 109 9 5 10 70 24 44 44 44 45 34 22 7 152 15 69 30 18 43 17 114 2 29 100 98 59 24 2 3

Thus, the genomic codon distributions for GCA (green) and TAT (red) are as follows respectively (cutoff=90):

0 0 0 0 0 2 3 0 1 0 0 0 0 0 1 1 0 4 4 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0[ ]

0 2 1 0 2 0 1 0 1 1 0 0 0 0 2 0 1 1 0 0 0 1 0 2 0 0 0 0 1 2 0 0 0 1 0 0 0 0 0 0 0 0 1 3 1 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[ ]

Similarly, the genomic codon distributions for all the 64 codons can be obtained for any complete genome.

The composition vector (K=3) corresponding to the above genome:

34 34 21 46 21 43 33 27 31 25 39 24 36 35 40 44 29 28 29 40 49 27 22 35 20 28 23 25 35 34 18 39 30 24 27 31 23 32 21 30 42 25 25 23 34 24 20 33 42 38 42 38 33 31 20 34 19 27 28 39 55 25 35 25[ ]

GGG

GGC

GGA

GGT

GCG

GCC

GCA

GCT

GAG

GAC

GAA

GAT

GTG

GTC

GTA

GTT

CGG

CGC

CGA

CGT

CCG

CCC

CCA

CCT

CAG

CAC

CAA

CAT

CTG

CTC

CTA

CTT

AGG

AGC

AGA

AGT

ACG

ACC

ACA

ACT

AAG

AAC

AAA

AAT

ATG

ATC

ATA

ATT

TGG

TGC

TGA

TGT

TCG

TCC

TCA

TCT

TAG

TAC

TAA

TAT

TTG

TTC

TTA

TTT

0 10 20 30 40 50 60 70 80 900

1

2

3

4

5

Number of GCA: 33

Height:0.28889

The genomic codon distribution for GCA

0 10 20 30 40 50 60 70 80 900

1

2

3

4

5

Number of TAT: 39

Height0.33333

The genomic codon distribution for TAT

Fig S3a3 Definition of the genomic codon distribution. Take the complete genome of Duck circovirusfor example, and take the codons GCA (denoted in green) and T AT (denoted in red) among the 64 codonsfor example. First, mark all the positions of GCA and T AT in the genome respectively; second, calculatethe codon intervals for GCA and T AT respectively, and obtain the distributions of intervals of GCA andT AT respectively, where a cutoff 90 is used in this simple case when counting the number of commonintervals (a sufficiently large enough cutoff 1000 is used in this paper); and third, obtain the genomic codondistributions for GCA and T AT by normalising the distributions of intervals of GCA and T AT respectively.The genomic codon distribution for a species consists of 64 distributions for each of the 64 codons. Anexample of unnormalised genomic codon distribution (take for example only two distributions for GCA andT AT ) is illustrated in two subfigures of the present figure. Four normalised genomic codon distributionsfor species in different domains are respectively shown in Fig S3b2. In addition, the summations of theelements of the unnormalised genomic codon distributions (e. g. 33 and 39 for GCA and T AT respectively)are obtained; consequently the average height of the genomic codon distribution is obtained for each of thedistributions of the 64 codons respectively. In a so-called composition vector tree method in the molecularphylogenetics, the elements of the K = 3 composition vector for a species (e. g. the elements 33 and39 for GCA and T AT ) correspond to the summations of the elements of the genomic cordon distributionsrespectively, which are also proportional to the average heights of the genomic codon distribution.

66


https://doi.org/10.1101/017988

1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100

1

101

201

301

401

501

601

701

801

901

1001

1101

1201

1301

1401

1501

1601

1701

1801

1901

2001

2101

2201

2301

2401

2501

2601

2701

2801

2901

3001

3101

3201

3301

3401

3501

3601

3701

3801

3901

ACCGGCGCT T GT A CT CCGT ACT CCGGGCCAA GGGGA T A GCA A CT GCA A T GGCGA A GA GCGGCA A CT A CT CA T A CA A GA GGT GGGT CT T T ACCA T T A A T AACCCT A CCT T T GA A GA CT A T GT T CA T GT T CT A GA GT T T T GCA CGCT CGA CAA T T GCA A GT T T GCT A T CGT CGGCGA GGA GAA GGGCGCGA ACGGCA CGCCTCA CCT T CA GGGA T T CCT GA ACCT CCGA A GT A A T GCGCGA GCT GCCGCCCT T GA GGA GT CGCT GGGA GGA AGA GCCT GGCT CT CT CGT GCCCGGGGA T CT GACGA A GA CA AT GA A GA A T A T T GCGCCA A A GA GT CGA CA T ACCT T CGA GT T GGT GA A CCA GT CT CCA A GGGT CGA T CCT CT GA T CT GGCCGA A GCGA CA T CCGCT GT GA T GGCT GGT GT GCCGCT A A CT GAGGT GGCCCGGA A GT T CCCCACGA CT T A T GT T A T CT T T GGGCGT GGCCT GGA A CGCCT CCGT CA CCT GA T CGT CGA GA CGCA A CGT GA T T GGA A GA CCGA AGT CA T CGT T CT GA T T GGT CCGCCCT GCA CCGGA A A GA GCCGT T A T GCA T T T GA A T T T CCCGCCGA A A A CAAGT A T T A CA AA CCA CGCGGGA A GT GGT GGGA CGGT T A CT CGGGA A A T GA CGT A GT CGT CAT GGA CGA CT T T T A T GGT T GGT T GCCGT A T GA T GA T T T GCTGA GA A T T A CCGA CCGT T A CCCGT T GCGGGT T GA A T T T A A AGGCGGT A T GACT CA GT T T GT GGCT A A A A CAT T GA T CA T A ACA A GT A A CCGCGA GCCT CGTGA T T GGT A CAA GA GCGA GT T T GA CCA T T CAGCT CT GT A CCGCA GA A T T A ACA A GT A T CT T GT GT A T A A T AT CGA CA A GT AT GA A CCCGCCCA GGCA T GT ACCCT CCCGT T CCCA A T A A A CT A CT GA GA CGA A A T T GCCGT T A GT CGT T T AT T A A GGT A A ACA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGTT A GGCA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CA GT T AA GGT T A GGCACT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGT T AGGCA CT T GGCAGGGT A T T A CA CT T GGGCA GCT CGGT T A A GGT T A GA A CA AGA T CT CGT T T A T GCCT A GA AT CCA GT GA A CT GT CCA A A T T T GA T A T A A A AT GT GA A CT GGGCCT CT A T GT CGT A T T GT GCA T T CA CA CCT GT T GT GT T GT CCGGCT T CCGT A GA CT CA T GCCCA T GCCGT A A T GCA CT A CT T GA T CT T GAGT GCA GT T GAT CCA GGT GA AA T T T T T GCCT CCGA GGA A GAA GGT A GA GT GCT T CGT A CCT T CA CCCGCT CCT T GT A T GA T GGGT T T A GGT A T GA A GT A CCGT CT GT GCACT CT GGA GGGGT CCCA CGT T CT CCT GCT GCT GGT T GA GCCGT A CGGGT CT AT GGA CCA T CCCGT CGT GGA T GT CT T CA CT AT T T GCCCGT CCT T GT CT A T CACCGT T GA GCCT T GA GT CT T GCT CT T CT GGT A A A T GT T GT A T GCCGGT CT GA GA GT T A T GGCT A CCCCCT T GA T CA T GT AGT A A T CGT A AGGGT A GT GGTAGA A CGGT GT T GA GCCGGT CA T GT A GCT GT T CGT GT CCCCGA A CA T CGCCCA T CT CA T GT T T A A GCCGCAT A T A T T GT T GCCA CGGGCCGGT GGGT CT GTGT A T T GT GCGCCA T CT T CT AA GCT T A A GCT T T GCCA T T T GCCT GCT GCA GT CGGT CCT GT GGT GGA GCCGA A A A A GCCA AA T A CGGT GT T CCT CGT GA CCT T A T A A GT CACT A CGGA A A ACCGT CGT CGGGGT CT T GCGAT T CGT A GCCT T CGT CT T CT GA A T CT CCT A CGT A GA CCCCGCCT CT T CCT CCGA CCA CGATAT GCA CGCCGA T A GGT GCGGCCGCGCA T CGGCA A CGT GGGCA GT GA CT T CCT GT A A GA T AT A A A A GA CCGCA CA A GCGCCGT T A T A T T A T T A CCGGCGCTT GT A CT CCGT A CT CCGGGCCA A GGGGA T A GCA A CT GCA A T GGCGA A GA GCGGCA A CT A CT CA T A CA A GA GGT GGGT CT T T A CCA T T A A T AA CCCT A CCT TT GA A GA CT A T GT T CA T GT T CT A GA GT T T T GCA CGCT CGA CA A T T GCA A GT T T GCT A T CGT CGGCGA GGA GA A GGGCGCGAA CGGCA CGCCT CA CCT T CAGGGA T T CCT GAA CCT CCGA A GT A A T GCGCGAGCT GCCGCCCT T GA GGA GT CGCT GGGA GGAA GA GCCT GGCT CT CT CGT GCCCGGGGA T CT GA CGA A GA CAAT GA A GA A T AT T GCGCCA A AGA GT CGA CA T A CCT T CGA GT T GGT GA A CCAGT CT CCA A GGGT CGA T CCT CT GA T CT GGCCGA A GCGA CA T CCGCT GT GATGGCT GGT GT GCCGCT A A CT GA GGT GGCCCGGA A GT T CCCCA CGA CT T A T GT T A T CT T T GGGCGT GGCCT GGA A CGCCT CCGT CA CCT GA T CGT CGA GA CGCA A CGT GA T T GGA A GA CCGAA GT CA T CGT T CT GA T T GGT CCGCCCT GCA CCGGA A A GA GCCGT T A T GCA T T T GA A T T T CCCGCCGA A A A CA A GT A T T A CAAA CCA CGCGGGA A GT GGT GGGA CGGT T A CT CGGGA A A T GACGT A GT CGT CA T GGA CGA CT T T T A T GGT T GGT T GCCGT A T GA T GA T T T GCT GA GA A T T ACCGA CCGT T A CCCGT T GCGGGT T GA A T T T A AA GGCGGT A T GA CT CA GT T T GT GGCT A A A A CA T T GA T CA T AA CA A GT A A CCGCGA GCCT CGT GA T T GGT ACAA GA GCGA GT T T GA CCA T T CA GCT CT GT A CCGCA GA A T T AA CA A GT A T CT T GT GT A T A A T A T CGA CA A GT A T GA A CCCGCCCA GGCA T GT A CCCT CCCGTT CCCA A T A A ACT A CT GA GA CGA A A T T GCCGT T A GT CGT T T A T T A A GGT A AA CA CT T GGCAGGGT A T T A CACT T GGGCA GCT CGGT T A A GGT T A GGCA CT TGGCA GGGT A T T A CA CT T GGGCA GCT CA GT T A A GGT T A GGCA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGT T A GGCA CT T GGCA GGGT A T T ACA CT T GGGCAGCT CGGT T A AGGT T A GA A CAA GA T CT CGT T T A T GCCT A GAA T CCA GT GA ACT GT CCA A A T T T GA T A T A A AA T GT GA A CT GGGCCT CT A T GT CGT A T T GT GCA T T CA CA CCT GT T GT GT T GT CCGGCT T CCGT A GA CT CA T GCCCA T GCCGT A A T GCA CT ACT T GA T CT T GA GT GCA GT T GA T CCA GGT GAAA T T T T T GCCT CCGA GGA A GA A GGT A GA GT GCT T CGT A CCT T CA CCCGCT CCT T GT A T GAT GGGT T T A GGT A T GA A GT A CCGT CT GT GCACT CT GGA GGGGT CCCA CGT T CT CCT GCT GCT GGT T GA GCCGT A CGGGT CT A T GGA CCA T CCCGT CGT GGAT GT CT T CA CT A T T T GCCCGT CCT T GT CT A T CA CCGT T GAGCCT T GA GT CT T GCT CT T CT GGT A A A T GT T GT A T GCCGGT CT GA GA GT T A T GGCT A CCCCCT T GA T CA T GT A GT A A T CGT AA GGGT A GT GGT A GA A CGGT GT T GA GCCGGT CA T GT A GCT GT T CGT GT CCCCGA A CA T CGCCCA T CT CA T GT T T A A GCCGCA T A T A T T GT T GCCA CGGGCCGGT GGGT CT GT GT A T T GT GCGCCA T CT T CT A A GCT T A A GCT T T GCCA T T T GCCT GCT GCAGT CGGT CCT GT GGT GGA GCCGA A A A A GCCAA A T A CGGT GT T CCT CGT GA CCT T A T A A GT CACT A CGGA A AA CCGT CGT CGGGGT CT T GCGA T T CGT A GCCT T CGT CT T CT GA A T CT CCT ACGT A GA CCCCGCCT CT T CCT CCGA CCA CGAT A T GCA CGCCGA T A GGT GCGGCCGCGCA T CGGCA A CGT GGGCA GT GA CT T CCT GT A A GA T A T A A A A GA CCGCA CA A GCGCCGT T A T A T T AT T

Genomic codon distribution for GCA (First half: original genome)0 0 0 0 0 2 3 0 1 0 0 0 0 0 1 1 0 4 4 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

Genomic codon distribution for GCA (Second half: its duplication)0 0 0 0 0 2 3 0 1 0 0 0 0 0 1 1 0 4 4 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

Genomic codon distribution for GCA (the duplicated genome)0 0 0 0 0 4 6 0 2 0 0 0 0 0 2 2 0 8 8 2 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0

0 20 40 60 800

1

2

3

4

Genomic codon distribution for a genome and its duplicatioin

0 20 40 60 800

2

4

6

8

Almost invariance of the genomic codon distribution in the duplication of a genome

Fig S3a4 Invariance of the genomic codon distribution after genome duplication. If a genome expands byduplicating itself, the genomic codon distribution does not change due to the normalisation. In the presentfigure, the distribution in the left subfigure and the distribution in the right subfigure will be same afternormalisation. In this sense, the genomic codon distribution is a conservative and intrinsic feature of aspecies when its genome evolves generally by large scale duplications (Fig S4a2). The trees of species in FigS3a5 and Fig S3d8 are obtained based on the genomic codon distributions of species with large variation ofgenome sizes, where the genome sizes themselves do not play a significant role also due to the normalisationin the definition of the genomic codon distribution.

67


https://doi.org/10.1101/017988

001BCya518BCya522BCya715BCya717BCya

387B720BCya

742AE004BPc

005B306BPd498BPd335BPb343BPb418BPb478BPb026BPc619BPc391BPc166BPc628BPc078BPd

213BAct

215BAct

216BAct

217BAct

189BPc

603BPc

604BPc

611BPc

610BPc

617BPc

614BPc

606BPc

607BPc

608BPc

609BPc

616BPc

605BPc

612BPc

613BPc

615BPc

259BPc

262BPc

265BPc

266BPc

267BPc

279BPc

268BPc

270BPc

277BPc

271BPc

272BPc

273BPc

280BPc

274BPc

276BPc

275BPc

278BPc

281BPc

269BPc

638BPc

639BPc

641BPc

643BPc

642BPc

644BPc

263BPc

618BPc

649BPc

320BPc

340BPc

506B

791BPc

792BPc

793BPc

795BPc

794BPc

796BPc

797BPc

799BPc

801BPc

798BPc

800BPc

802BPc

640BPc

475BPc

476BPc

260BPc

345BPc

346BPc

185BCho

436BAct

787BPc

788BPc

789BPc

790BPc

336BCho

031BPc

531BPc

392BPc

505BPc

626BPc

620BPc

629BPc

630BPc

621BPc

622BPc

623BPc

624BPc

631BPc

634BPc

636BPc

625BPc

627BPc

632BPc

633BPc

635BPc

637BPc

764BPc

765BPc

771BPc

772BPc

768BPc

769BPc

770BPc

511BPb

512BPb

750BPc

349BF

358BF

350BF

351BF

549BPc

550BPc

551BPc

186BC

hl

220BPc

224BPc

222BPc

221BPc

223BPc

532BPc

598BPc

180BC

hl803B

pa156A

E303B

Pd

181BC

hl

182BC

hl

528BC

hl

183BC

hl

501BC

hl

529BC

hl407A

E410A

E324A

E017B

Pc

018BPc

019BPc

020BPc

389BPc

048BF

062BF

063BF

065BF

060BF

061BF

064BF

069BC

hl070B

Chl

071BC

hl072B

Chl

492B

Chl

713B

Pe

513B

Chl

514B

Chl

210B

F411A

E413A

E412A

E399A

E738A

E737A

E415A

E471B

Pe

163B

F240B

F241B

F474B

Pc

726B

242B

Pd

243B

F416B

724B

cya

231B

Chl

285B

Chl

752B

Spi

486B

F233B

Cho

234B

Cho

235B

Cho

038B

F282B

F300B

F301B

F036B

Pa

756B

Act

757B

Act

753B

Spi

754B

Spi

352B

F353B

F758A

E354B

F464B

Pb

465B

Pb

466B

Pb

469B

Pb

467B

Pb

468B

Pb

249B

Pc

258B

010B

Pc

012B

Pc

013B

Pc

011B

Pc

014B

Pc

016B

Pc

211B

Pc

530B

552B

Pc

379B

F380B

fir027B

Pc

766B

Pc

767B

Pc

313B

Pc

314B

Pc

315B

Pc

317B

Pc

316B

Pc

496B

Pc

359B

F360B

F497B

F361B

F318B

Pc

319B

Pc

363B

F364B

F365B

F675B

F677B

F676B

F679B

F700B

F701B

F702B

F681B

F684B

F686B

F682B

F685B

F683B

F687B

F698B

F689B

F695B

F692B

F697B

F694B

F691B

F690B

F696B

F693B

F688B

F699B

F703B

F705B

F704B

F680B

F678B

F073B

Pa

074B

Pa

076B

Pa

075B

Pa

367B

Pc

368B

Pc

369B

Pc

370B

Pc

743B

The

032B

Cya

482B

Cya

481B

Cya

426B

Cya

228B

Cya

229B

Cya

230B

Cya

755B

Cya

049B

F050B

F051B

F053B

F06

6BF

067B

F05

9BF

052B

F05

5BF

054B

F05

6BF

058B

F06

8BF

057B

F48

4B38

6BF

261B

F38

1BF

382B

F38

3BF

384B

F38

5BF

654B

F65

5BF

661B

F65

6BF

667B

F66

4BF

663B

F66

6BF

657B

F65

8BF

659B

F66

0BF

662B

F66

5BF

671B

F67

0BF

668B

F66

9BF

348B

F35

5BF

357B

F35

6BF

493B

Ver

029B

F03

0BF

208B

F31

1BChl

462B

F72

9BF

731B

F73

0BF

203B

F28

3BThe

503B

The

137B

015B

Pc

037B

Pa

470B

Pa

212B

F

167B

Ver

168B

Ver

169B

Ver

170B

Ver

171B

Ver

172BVer

173BVer

174BVer

175BVer

176BVer

178BVer

177BVer

516BCya

519BCya

525BCya

526BCya

372BSpi

373BSpi

374BSpi

375BSpi

376BSpi

377BSpi

759B

805E2

818E15

807E4

819E16

776BPa

284BF

286BChl

402AE

403AE

404AE

405AE

406AE

457BF

583BPa

587BPa

591BPa

592BPa

589BPa

588BPa

584BPa

585BPa

586BPa

672AC

709AC

710AC

711AC

751BPc

192BF

193BF

202BF

362BF

740BThe

741BThe

139BPe

140BPe

141BPe

142BPe

149BChl

148BChl

774BPa

326BPe

328BPe

329BPe

332BPe

330BPe

331BPe

333BPe

777B

327BPe

039B040AE

733AE

734AE

744BThe

745BThe

746BThe

397AE

155A248AC

396AC

557AE

559AE

558AE

138AC

507AE002BF287BPc

288BPc

294BPc

292BPc295BPc289BPc290BPc291BPc293BPc160BPc162BPc590BPa593BPa077B143BPe144BPe147BPe145BPe146BPe447BF337BAqu712BAqu400AE515BCya521BCya520BCya523BCya517BCya524BCya735B775BPa366BPd804E1806E3808E5810E7811E8814E11815E12817E14809E6812E9816E13813E10023AC

338AC553AC556AC554AC555AC739AC341AC736AC151BPc451BF452BF453BF194BF196BF197BF198BF201BF195BF209BF199BF200BF205BF206BF207BF204BF250B251B398AE477AC401AE449BF044BF488B158BF157BPa446BF459BF490BPa491BPa090BSpi094BSpi

091BSpi092BSpi

095BSpi097BSpi

003BPa308BPa

309BPa047BPa

778BPa388BPa

582BPa046BPb

749BPb098BPa

099BPa100BPa

575BPa576BPa

577BPa578BPa

580BPa579BPa

483BPa489B425BPc

252BPa645BPa

008BPb238BPb

187BPc086BPb

087BPb009BPb

089BPb510BPb

533BPc534BPc

536BPc535BPc

537BPc538BPc

540BPc545BPc

225BPb

560BPb

564BPb

045BPb

763BPb

673BPc

674BPc

247BPd

164BPa

165BPa

419BPa

423BPa

420BPa

421BPa

422BPa

028BPc

325BPc

494BPa

237BD

ei

297BA

ct

298BA

ct

601BA

ct

602BA

ct

429BA

ct

430BA

ct

433BA

ct

439BA

ct

445BA

ct

434BA

ct

435BA

ct

438BA

ct

573BA

ct

732BA

ct

460BPd

581BPa

714B

115BPb

119BPb

117BPb

118BPb

120BPb

116BPb

121BPb

135BPb

127BPb

126BPb

130BPb

131BPb

132BPb

133BPb

134BPb

122BPb

125BPb

124BPb

123BPb

378B

417BPb

653BPa

570BPa

572BPa

571BPa

651B

652BPa

296BA

ct

599BA

ct

706Bact

707Bact

708Bact

479Bact

006B

Aci

650B

Aci

307B

cya

085B

Pb

232B

Pb

546B

Pc

547B

Pc

548B

Pc

539B

Pc

541B

Pc

543B

Pc

542B

Pc

544B

Pc

509B

Pb

574B

Pb

128B

Pb

129B

Pb

136B

Pb

561B

Pb

562B

Pb

563B

Pb

188B

Pc

779B

Pc

781B

Pc

780B

Pc

782B

Pc

783B

Pc

784B

Pc

785B

Pc

786B

Pc

080B

Act

082B

Act

083B

Act

084B

Act

081B

Act

245B

Pd

246B

Pd

024B

Pa

395B

Pa

566B

Pa

567B

Pa

568B

Pa

569B

Pa

647B

Pa

648B

Pa

495B

Pa

472B

Pa

473B

Pa

487B

Pa

339B

Pa

394B

Pa

424B

Pa

264B

Pa

342B

Pa

596B

Pa

646B

Pa

390B

Pa

079B

101B

pa

102B

pa

105B

pa

103B

pa

107B

Pa

108B

Pa

106B

pa

104B

pa

485B

Pa

244B

Pd

312B

310B

Pa

594B

Cho

595B

Cho

021B

PC

022B

PC

153B

Pc

302B

Pd

304B

Pd

305B

Pd

499B

Pd

179B

Chl

184B

Chl

500B

Chl

725B

Pd

154B

F025B

239B

Pd

427B

F502B

F334B

F

727B

Pd

042B

Act

043B

Act

214B

Act

218B

Act

219B

Act

428B

Act

431B

Act

432B

Act

440B

Act

441B

Act

442B

Act

443B

Act

437B

Act

527B

Act

226B

cya

227B

cya

716B

Cya

722B

Cya

723B

Cya

721B

Cya

565B

Act

718B

Cya

719B

Cya

236B

Dei

321A

E

463A

E

408A

E

600B

Chl

409A

E

371B

act

007B

Act

444B

Act

322A

E

323A

E

088B

Pb

508B

033B

Pd

035B

Pd

034B

Pd

190B

Act

191B

Act

347B

Act

504B

Pa

597B

Act

480B

Act

344B

Act

929V

dD

909V

dD

461A

041B

Pe

299B

414A

E

393B

F

760B

F

761B

F

762B

F

109B

Pc

110B

Pc

113B

Pc

114B

Pc

150B

Pc

112B

Pc

253B

Pa

254B

Pa

255B

Pa

256B

Pa

257B

Pa

908V

dD

873V

dD

931V

dD

093B

Spi

096B

Spi

450B

F

925V

dD

927V

UP

866V

dD

895V

dD

896V

dD

950V

dD

951V

dD

728B

Pb

901V

dD

773B

Pc

747B

Dei

748B

Dei

159B

448B

F

454B

F

458B

F

456B

F

827V

dD

919V

dD

875V

dD

823V

dD

882V

dD

884V

dD

824V

dD

821V

dD

825V

dD

915V

dD

917V

dD

916V

dD11

1BPc

887V

dD

889VdD

891VdD

890VdD

888VdD

822VdD

826VdD

834VU

P

902VdD881VdD

921VdD920VdD

918VdD455BF

161BChl

871VdD867VdD

870VdD868VdD

869VdD906VdD

913VdD900VdD

922VdD938VdD

936VsR928VdD

940VdD946VdD

941VdD874VdD

907VdD883VdD937VsR935VsR

914VdD152BPc

910VdD872VUP

894VRT945VsR944VsR

877VUV924VdD933VsR932VsR926VdD942VsD943VsD

847VdRm947VsR892VsR854VdRm

849VdRm857VdRm

836VdRm856VdRm

845VdRm864VdRm838VdRm832VsDm905VsDo876VsD885VsRb930VsD899VsD830VsDc893VsD903VsD923VsD911VsDcl833VsD835VUC904VsDo878VsRe828VdDB912VsDcl829VdDB840VdRm846VdRm848VdRm863VdRm839VdRm862VdRm837VdRm853VdRm860VdRm831VsD952VSat934VSat949VdRr843VdRm850VdRm859VdRm844VdRm861VdRm948VdRr886VsRb855VdRm

851VdRm841VdRm858VdRm879VsRe

852VdRm842VdRm865VdRm880VsRe939VUC897VUC898VUC820VUC

Fig S3a5 Tree of species according to the average correlation coefficients among genomic codon distributionsby averaging for the 64 codons. This tree is reasonable to some extent. Considering the also reasonable treeof codons in Fig 3A that is obtained based on the same data, it is inferred that an intrinsic relationship existsbetween the tree of life and the evolution of the genetic code. Please enlarge to see details. Refer to the legendof Fig 3D in the appendix to see the abbreviations of the taxa.

68


https://doi.org/10.1101/017988

BeeT

halecressN

emat

ode

Beetle

Cerevisiae

Cow

Dog

Zebrafish

Horse O

poss

um

Mou

se

Rat

Hum

an

Chimpanzee

M. Mulatta

Chicken

Fig S3a6 Tree of eukaryotes based on the average correlation coefficients among genomic codondistributions. This tree is reasonable to some extent. Note that most eukaryotic genomes are non-codingDNA. So this reasonable tree is mostly based on the non-coding DNA, which indicates that the non-codingDNA, rather than the coding DNA, plays a principal role in the phylogeny of eukaryotes.

69


https://doi.org/10.1101/017988

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

0.05

Virus

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

0.05

Bacteria

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

0.05

Archaea

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

0.05

Eukaryote

Fig 3B Features of genomic codon distributions for domains based on complete genome sequences. Referto Fig S3b1 to see the genomic codon distributions of Crenarchaeota and Euryarchaeota; refer to Fig S3b2 tosee the genomic codon distributions for species; and refer to Fig S3b3 to see the simulations of the genomiccodon distributions.

70


https://doi.org/10.1101/017988

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

0.05

Crenarchaeota

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

0.05

Euryarchaeota

Fig S3b1 Genomic codon distributions for Crenarchaeota and Euryarchaeota based on genomic data. Thefeatures for both Crenarchaeota and Euryarchaeota agree with that for Archaea in Fig 3B, while they arequite different from the features for Eukarya in Fig 3B. So the relationship between Crenarchaeota andEuryarchaeota is closer than with Eukarya, which tends to agree with the three-domain tree rather than theeocyte tree.

71


https://doi.org/10.1101/017988

0 9 18 27 36 45 54 63 72 81 90 990

0.02

0.04

0.06

0.08

0.1

Virus: Yersinia phage Berlin

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

0.05

Bacteria: Zymomonas mobilis ZM4

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

0.05

Archaea: Methanococcus vannielii SB

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

0.05

Eukaryote: Homo sapiens

Fig S3b2 Features of genomic codon distributions of species in different domains based on genomic data.The features for species in the present figures agree with the features for the corresponding domains in Fig3B respectively. For example, the features for eukaryotes are quite different from the features for the otherdomains, either for any species in the present figure or for the domain in Fig 3B.

72


https://doi.org/10.1101/017988

0 9 18 27 36 45 54 63 72 81 90 990

0.02

0.04

0.06

0.08

simulation for virus

0 9 18 27 36 45 54 63 72 81 90 990

0.02

0.04

0.06

0.08

simulation for bacteria

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

simulation for archaea

0 9 18 27 36 45 54 63 72 81 90 990

0.01

0.02

0.03

0.04

simulation for eukaryote

Fig S3b3 Simulations of features of genomic codon distributions for the different domains. This simulationresults for the domains agree with the observations based on genomic data in Fig 2B, respectively. Thesimulations also reveal the mechanism that results in such features.

73


https://doi.org/10.1101/017988

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10#11#12#13#14#15#16#17#18#19#20#21#22#23#24#25#26#27#28#29#30#31#32

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

44

46

48

50

52

54

56

58

60

62

64

Recruitment order of the 32 codon pairs

Heig

ht ord

er

of th

e 6

4 g

enom

ic c

odon d

istr

ibutions for

dom

ain

s

Virus

Bacteria

Crenarchaeota

Euryarchaeota

Archaea

Eukaryote

Fig 3C On the origin order of the domains. The recruitment order of the 32 codon pairs indicates anevolutionary direction, which reveals the origin order of the domains.

74


https://doi.org/10.1101/017988

64 CGA #13 H2 GCC #02 H1 AAU #30 H4 AAU #30 H4 AAU #30 H4 AUU #30 H4

63 GGC #02 H1 GGC #02 H1 AUU #30 H4 AUU #30 H4 AUU #30 H4 AAU #30 H4

62 CCG #10 H1 CGC #07 H1 GAA #23 H3 GAA #23 H3 GAA #23 H3 UUU #32 H4

61 CGG #10 H1 GCG #07 H1 UUC #23 H3 UUC #23 H3 UUC #23 H3 AAA #32 H4

60 UCG #13 H2 CCG #10 H1 AUA #29 H4 AUA #29 H4 AUA #29 H4 UUA #31 H4

59 CGC #07 H1 CGG #10 H1 UAU #29 H4 UAU #29 H4 UAU #29 H4 UAA #31 H4

58 GCG #07 H1 GCA #09 H2 AAG #27 H3 AAG #27 H3 AAG #27 H3 CUU #27 H3

57 GCC #02 H1 UGC #09 H2 CUU #27 H3 CUU #27 H3 CUU #27 H3 AAG #27 H3

56 GAC #05 H2 UCG #13 H2 UAA #31 H4 UAA #31 H4 UAA #31 H4 UCU #14 H3

55 UGG #12 H2 CGA #13 H2 UUA #31 H4 UUA #31 H4 UUA #31 H4 AGA #14 H3

54 CAA #28 H3 AUU #30 H4 GGA #03 H2 GGA #03 H2 GGA #03 H2 UUC #23 H3

53 CUG #20 H2 AAU #30 H4 UCC #03 H2 UCC #03 H2 UCC #03 H2 GAA #23 H3

52 ACG #16 H2 AUC #21 H3 AAA #32 H4 AUC #21 H3 AAA #32 H4 UGA #15 H3

51 GAA #23 H3 GAU #21 H3 UUU #32 H4 GAU #21 H3 UUU #32 H4 UCA #15 H3

50 GCA #09 H2 CAG #20 H2 UCA #15 H3 UCA #15 H3 UCA #15 H3 CUG #20 H2

49 ACC #06 H2 CUG #20 H2 UGA #15 H3 AAA #32 H4 UGA #15 H3 CAG #20 H2

48 AAG #27 H3 AGC #08 H2 AUC #21 H3 UGA #15 H3 AUC #21 H3 UUG #28 H3

47 CAG #20 H2 UUC #23 H3 GAU #21 H3 UUU #32 H4 GAU #21 H3 CAA #28 H3

46 GAU #21 H3 GCU #08 H2 AGG #11 H2 AGA #14 H3 AGA #14 H3 UAU #29 H4

45 GCU #08 H2 GAA #23 H3 AGA #14 H3 AGG #11 H2 UCU #14 H3 AUA #29 H4

44 UGA #15 H3 CCA #12 H2 CCU #11 H2 UCU #14 H3 AGG #11 H2 AUG #22 H3

43 AUC #21 H3 UGG #12 H2 UCU #14 H3 CCU #11 H2 CCU #11 H2 CAU #22 H3

42 GGU #06 H2 UUG #28 H3 CAA #28 H3 CAA #28 H3 CAA #28 H3 UGU #18 H3

41 GUC #05 H2 CAA #28 H3 GAG #04 H2 UUG #28 H3 UUG #28 H3 ACA #18 H3

40 UCA #15 H3 UCA #15 H3 UUG #28 H3 GAG #04 H2 GAG #04 H2 UGG #12 H2

39 CCA #12 H2 UGA #15 H3 CUC #04 H2 CUC #04 H2 CUC #04 H2 CCA #12 H2

38 GGA #03 H2 CUU #27 H3 CAU #22 H3 CAU #22 H3 CAU #22 H3 CCU #11 H2

37 AAC #26 H3 AAG #27 H3 AUG #22 H3 AUG #22 H3 AUG #22 H3 AGG #11 H2

36 AGC #08 H2 CAU #22 H3 GGC #02 H1 CAG #20 H2 CAG #20 H2 AGU #17 H3

35 CGU #16 H2 AUG #22 H3 GCC #02 H1 CUG #20 H2 CUG #20 H2 ACU #17 H3

34 GAG #04 H2 UUU #32 H4 CAG #20 H2 GGC #02 H1 GGC #02 H1 UCC #03 H2

33 AUG #22 H3 AAA #32 H4 CUG #20 H2 GCC #02 H1 GCC #02 H1 GGA #03 H2

32 UUC #23 H3 ACC #06 H2 CCA #12 H2 CCA #12 H2 AAC #26 H3 GUU #26 H3

31 UGC #09 H2 GGU #06 H2 AAC #26 H3 AAC #26 H3 GUU #26 H3 AAC #26 H3

30 GUG #19 H2 AAC #26 H3 UGG #12 H2 UGG #12 H2 CCA #12 H2 CUC #04 H2

29 ACA #18 H3 GUU #26 H3 GUU #26 H3 GUU #26 H3 UGG #12 H2 GAG #04 H2

28 CAC #19 H2 UUA #31 H4 AGC #08 H2 AGC #08 H2 AGC #08 H2 UGC #09 H2

27 AGG #11 H2 UAA #31 H4 GCU #08 H2 GCU #08 H2 GCU #08 H2 GCA #09 H2

26 AGA #14 H3 CGU #16 H2 CGG #10 H1 CGG #10 H1 CGG #10 H1 GAU #21 H3

25 CAU #22 H3 ACG #16 H2 CCG #10 H1 CCG #10 H1 CCG #10 H1 AUC #21 H3

24 GUU #26 H3 UCC #03 H2 UCG #13 H2 UCG #13 H2 UCG #13 H2 GCU #08 H2

23 UUG #28 H3 GGA #03 H2 CGA #13 H2 CGA #13 H2 CGA #13 H2 AGC #08 H2

22 AAU #30 H4 GUC #05 H2 ACC #06 H2 GCA #09 H2 GCA #09 H2 UAG #25 H3

21 UCC #03 H2 GAC #05 H2 GCA #09 H2 UGC #09 H2 ACC #06 H2 CUA #25 H3

20 AUU #30 H4 CCU #11 H2 GGU #06 H2 ACC #06 H2 UGC #09 H2 GUG #19 H2

19 CUC #04 H2 AGG #11 H2 UGC #09 H2 GGU #06 H2 GGU #06 H2 CAC #19 H2

18 CCU #11 H2 UAU #29 H4 AGU #17 H3 AGU #17 H3 AGU #17 H3 GUA #24 H3

17 UGU #18 H3 AUA #29 H4 ACU #17 H3 ACU #17 H3 ACU #17 H3 UAC #24 H3

16 AAA #32 H4 CAC #19 H2 ACA #18 H3 ACA #18 H3 ACA #18 H3 GGU #06 H2

15 CUU #27 H3 GUG #19 H2 UGU #18 H3 UGU #18 H3 UGU #18 H3 ACC #06 H2

14 ACU #17 H3 UCU #14 H3 UAC #24 H3 UAC #24 H3 UAC #24 H3 GCC #02 H1

13 UAC #24 H3 AGA #14 H3 GUA #24 H3 GUA #24 H3 GUA #24 H3 GGC #02 H1

12 UCU #14 H3 CUC #04 H2 GAC #05 H2 GAC #05 H2 GAC #05 H2 GUC #05 H2

11 AGU #17 H3 GAG #04 H2 GUC #05 H2 GUC #05 H2 GUC #05 H2 GAC #05 H2

10 GGG #01 H1 ACA #18 H3 ACG #16 H2 CGU #16 H2 CGU #16 H2 CCC #01 H1

9 CCC #01 H1 UGU #18 H3 CGU #16 H2 ACG #16 H2 ACG #16 H2 GGG #01 H1

8 UUU #32 H4 CCC #01 H1 CGC #07 H1 CGC #07 H1 CGC #07 H1 CGU #16 H2

7 UAU #29 H4 GGG #01 H1 GCG #07 H1 GCG #07 H1 GCG #07 H1 ACG #16 H2

6 GUA #24 H3 ACU #17 H3 UAG #25 H3 UAG #25 H3 UAG #25 H3 CCG #10 H1

5 UAA #31 H4 AGU #17 H3 CUA #25 H3 CUA #25 H3 CUA #25 H3 CGG #10 H1

4 UUA #31 H4 UAC #24 H3 CAC #19 H2 CAC #19 H2 CAC #19 H2 UCG #13 H2

3 AUA #29 H4 GUA #24 H3 GUG #19 H2 GUG #19 H2 GUG #19 H2 CGA #13 H2

2 CUA #25 H3 CUA #25 H3 GGG #01 H1 GGG #01 H1 GGG #01 H1 CGC #07 H1

1 UAG #25 H3 UAG #25 H3 CCC #01 H1 CCC #01 H1 CCC #01 H1 GCG #07 H1

Height order of the 64 genomic codon distributions for domains

Order Virus Bacteria Crenarchaeota Euryarchaeota Archaea Eukaryote

H1: Hierarchy 1 H2: Hierarchy 2 H3: Hierarchy 3 H4: Hierarchy 4

Fig S3c1 The orders of the average distribution heights for different domains, where the values of genomiccodon distributions here are obtained based on genomic data in Fig 3B and Fig S3b1. The orders of theaverage distribution heights for Archaea and Eukarya are more similar with respect to Hierarchy 1 − 4 thanthat for Virus and Bacteria.

75


https://doi.org/10.1101/017988

64 GAG #04 H2 GGC #02 H1 UUA #31 H4 AUU #30 H4

63 GCG #07 H1 GCC #02 H1 UAA #31 H4 UUA #31 H4

62 CGG #10 H1 GCG #07 H1 AUU #30 H4 UAU #29 H4

61 GGC #02 H1 CCG #10 H1 AAU #30 H4 UUG #28 H3

60 AGG #11 H2 CGA #13 H2 AUA #29 H4 GUU #26 H3

59 CGA #13 H2 UGC #09 H2 UCA #15 H3 UAA #31 H4

58 GCC #02 H1 CGG #10 H1 CAU #22 H3 AAU #30 H4

57 CGC #07 H1 CAG #20 H2 AAC #26 H3 UGU #18 H3

56 GGG #01 H1 CCU #11 H2 AAG #27 H3 CUU #27 H3

55 CCG #10 H1 ACC #06 H2 GAA #23 H3 UUC #23 H3

54 ACG #16 H2 CGC #07 H1 UAU #29 H4 UCU #14 H3

53 AGC #08 H2 AGC #08 H2 CUU #27 H3 AUA #29 H4

52 GGA #03 H2 UGG #12 H2 AUC #21 H3 UUU #32 H4

51 GAC #05 H2 AGG #11 H2 AGU #17 H3 AGU #17 H3

50 CAG #20 H2 GCA #09 H2 UAG #25 H3 UGA #15 H3

49 GCA #09 H2 CGU #16 H2 UAC #24 H3 CUA #25 H3

48 AGA #14 H3 GAC #05 H2 CAA #28 H3 AUC #21 H3

47 GGU #06 H2 CUG #20 H2 GUU #26 H3 UAG #25 H3

46 GUG #19 H2 GGG #01 H1 UUC #23 H3 GAU #21 H3

45 AAG #27 H3 GCU #08 H2 ACU #17 H3 ACU #17 H3

44 CCC #01 H1 CAC #19 H2 GUA #24 H3 GUA #24 H3

43 GAA #23 H3 UCG #13 H2 UGA #15 H3 UCA #15 H3

42 GUC #05 H2 GUG #19 H2 UUG #28 H3 AUG #22 H3

41 CCA #12 H2 GGU #06 H2 AGA #14 H3 UAC #24 H3

40 CAC #19 H2 ACG #16 H2 GAU #21 H3 CAU #22 H3

39 ACC #06 H2 CCC #01 H1 ACA #18 H3 GAA #23 H3

38 UCG #13 H2 CCA #12 H2 UCU #14 H3 AAG #27 H3

37 CGU #16 H2 GUC #05 H2 CUA #25 H3 AGA #14 H3

36 UGC #09 H2 GGA #03 H2 AUG #22 H3 AAA #32 H4

35 UGG #12 H2 CUC #04 H2 AAA #32 H4 AAC #26 H3

34 AAC #26 H3 GAG #04 H2 CGA #13 H2 CAA #28 H3

33 CAA #28 H3 AGU #17 H3 ACG #16 H2 ACA #18 H3

32 GCU #08 H2 UAG #25 H3 UGU #18 H3 CUG #20 H2

31 ACA #18 H3 GAA #23 H3 UUU #32 H4 GUC #05 H2

30 GUA #24 H3 AAG #27 H3 CAG #20 H2 CCU #11 H2

29 CUG #20 H2 UUG #28 H3 GCA #09 H2 CGU #16 H2

28 AGU #17 H3 UCA #15 H3 CCA #12 H2 GGU #06 H2

27 UAG #25 H3 CAA #28 H3 GCU #08 H2 UCC #03 H2

26 CCU #11 H2 GUA #24 H3 AGG #11 H2 UGG #12 H2

25 UGA #15 H3 GUU #26 H3 UGC #09 H2 UGC #09 H2

24 UCC #03 H2 ACU #17 H3 AGC #08 H2 UCG #13 H2

23 AUG #22 H3 UCC #03 H2 ACC #06 H2 GCU #08 H2

22 CUC #04 H2 CUA #25 H3 CUC #04 H2 GUG #19 H2

21 GAU #21 H3 AGA #14 H3 CCU #11 H2 CUC #04 H2

20 CUA #25 H3 AAC #26 H3 CUG #20 H2 CCA #12 H2

19 UAC #24 H3 UAC #24 H3 CAC #19 H2 ACC #06 H2

18 UCA #15 H3 ACA #18 H3 GGU #06 H2 CAC #19 H2

17 AAA #32 H4 CUU #27 H3 UCG #13 H2 GGA #03 H2

16 CAU #22 H3 AUG #22 H3 CGU #16 H2 AGC #08 H2

15 AUC #21 H3 GAU #21 H3 GAC #05 H2 CGA #13 H2

14 ACU #17 H3 UUC #23 H3 UCC #03 H2 AGG #11 H2

13 UGU #18 H3 UGA #15 H3 GAG #04 H2 ACG #16 H2

12 UUG #28 H3 UGU #18 H3 GUC #05 H2 CAG #20 H2

11 UAA #31 H4 AAU #30 H4 GUG #19 H2 GAC #05 H2

10 GUU #26 H3 AUA #29 H4 UGG #12 H2 GCA #09 H2

9 AAU #30 H4 CAU #22 H3 GGA #03 H2 GAG #04 H2

8 UCU #14 H3 UAU #29 H4 GCC #02 H1 GCC #02 H1

7 AUA #29 H4 AUC #21 H3 GGC #02 H1 CCG #10 H1

6 CUU #27 H3 AUU #30 H4 CGC #07 H1 CGC #07 H1

5 UAU #29 H4 UUA #31 H4 CCG #10 H1 CGG #10 H1

4 UUC #23 H3 UAA #31 H4 GCG #07 H1 GGC #02 H1

3 UUA #31 H4 UCU #14 H3 CGG #10 H1 GCG #07 H1

2 AUU #30 H4 UUU #32 H4 CCC #01 H1 GGG #01 H1

1 UUU #32 H4 AAA #32 H4 GGG #01 H1 CCC #01 H1

Simulation of height order of the 64 genomic codon distributions for domains

Order simulaton for virus simulaton for bacteria simulaton for archaea simulaton for eukaryote

H1: Hierarchy 1 H2: Hierarchy 2 H3: Hierarchy 3 H4: Hierarchy 4

Fig S3c2 Simulations of the orders of the average distribution heights for different domains, where the valuesof genomic codon distributions here are obtained from the simulation results in Fig S3b3. These simulationresults for the domains agree with the observations in Fig S3c1 based on genomic data, respectively.

76


https://doi.org/10.1101/017988

−0.2−0.15

−0.1−0.05

00.05

0

1

2

3

Flu

ctu

ation

−0.06−0.04

Route bias

−0.020

0.02Hierarchy bias

E

AE

AC

BPa

BPb

BPc

BPd

BPe

BF

BAct

BCya

BChl

BSpi

BVer

BThe

Bother

VdD

VsR

VdRm

VsD

VdDB

Vsat

Fig 3D Classification of Virus (yellow), Bacteria (blue), Archaea (red) and Eukarya (green) in theevolution space, which is based on the roadmap and the complete genome sequences. Refer to FigS3d3, S3d4, S3d5 to see the two dimensional projections of this figure. The abbreviations for thetaxa are as follows: Eukarya (E), Euryarchaeota (AE), Crenarchaeota (AC), Alphaproteobacteria (BPa),Betaproteobacteria (BPb), Gammaproteobacteria (BPc), Deltaproteobacteria (BPd), Epsilonproteobacteria(BPe), Firmicutes (BF), Actinobacteria (BAct), Cyanobacteria (BCya), Chlorobi (BChl), Spirochaetes(BSpi), Chlamydiae/Verrucomicrobia (BVer), Thermotogae (BThe), other bacteria (Bother), dsDNA virus(VdD), ssRNA virus (VsR), Mammalian orthoreovirus (VdRm), ssDNA virus (VsD), Blainvillea yellow spotvirus (VdDB), satellite virus (Vsat).

77


https://doi.org/10.1101/017988

20 combinations 64 permutations

Typ

es o

f gen

om

ic c

od

on

dis

tribu

tion

s

<G>

<C>

<A>

<T>

Ro

ute

0R

ou

te 1

~3

Ro

ute

0R

ou

te 1

~3

Ro

ute

0R

ou

te 1

~3

Ro

ute

0R

ou

te 1

~3

R s

tran

dY

stra

nd

R s

tran

dY

stra

nd

Hie

rarc

hy 1

~2

Hie

rarc

hy 3

~4

<G,G,G>

<G,G,A>

<G,G,C>

<G,G,T>

<G,C,A>

<C,C,C>

<C,C,T>

<C,C,G>

<C,C,A>

<G,C,T>

<A,A,A>

<A,A,G>

<A,A,T>

<A,A,C>

<A,T,G>

<T,T,T>

<T,T,C>

<T,T,A>

<T,T,G>

<A,T,C>

t1

t3, t10

t2

t12

t4, t9, t14

t7

t6

t13

t5, t11

t20

t19

t16

t17

t8, t15, t18

GGG

GGA

GGC

GGT

AGC

GAC

GAG

GCG

GTG

GCA

ACG

AGG

CGG

TGG

CAG

CGA

CCC

TCC

GCC

ACC

GCT

GTC

CTC

CGC

CAC

TGC

CGT

CCT

CCG

CCA

CTG

TCG

AAA

AAG

AAT

AAC

GAT

AGT

AGA

ATA

ACA

ATG

GTA

GAA

TAA

CAA

TGA

TAG

TTT

CTT

ATT

GTT

ATC

ACT

TCT

TAT

TGT

CAT

TAC

TTC

TTA

TTG

TCA

CTA

Fig S3d1 The 20 types of genomic codon distributions. There are respectively 20 combinations and 64permutations of the 4 bases. There is a coarse correlation between the 20 combinations of codons and the 20amino acids according to the tRNAs t1 to t20 (Fig S2a1).

78


https://doi.org/10.1101/017988

#11

G

G

A

C

C

T

#14

A

G

A

T

C

T

#1

G

G

G

C

C

C

#3

A

G

G

T

C

C

#27

G

A

A

C

T

T

#32

A

A

A

T

T

T

#4

G

A

G

C

T

C

#23

A

A

G

T

T

C

#5

C

A

G

G

T

C

#26

C

A

A

G

T

T

#2

C

G

G

G

C

C

#8

C

G

A

G

C

T

#21

T

A

G

A

T

C

#30

T

A

A

A

T

T

#6

T

G

G

A

C

C

#17

T

G

A

A

C

T

#16

G

C

A

C

G

T

#18

A

C

A

T

G

T

#7

G

C

G

C

G

C

#9

A

C

G

T

G

C

#22

G

T

A

C

A

T

#29

A

T

A

T

A

T

#19

G

T

G

C

A

C

#24

A

T

G

T

A

C

#20

G

A

C

C

T

G

#28

A

A

C

T

T

G

#10

G

G

C

C

C

G

#13

A

G

C

T

C

G

#25

G

A

T

C

T

A

#31

A

A

T

T

T

A

#12

G

G

T

C

C

A

#15

A

G

T

T

C

A


Ro

ute

0

Ro

ute

1

Ro

ute

2

Ro

ute

3

Ro

ute

bia

sfo

r gen

om

ic c

od

on

dis

tribu

tion

s

Hierarchy bias

for genomic codon distributions

Fig S3d2 Explanation of the definitions of hierarchy bias and route bias. The hierarchy bias is definedby the difference between the averages of genomic codon distributions for Hierarchy 1 − 2 and that forHierarchy 3 − 4. And the route bias is defined by the difference between the averages of genomic codondistributions for Route 0 and that for Route 1 − 3. By the way, the leaf node bias is defined by the differencebetween the averages of genomic codon distributions for the leaf nodes in Route 0 − 1 and that for the leafnodes in Route 2 − 3.

79


https://doi.org/10.1101/017988

−0.2 −0.15 −0.1 −0.05 0 0.05−0.08

−0.06

−0.04

−0.02

0

0.02

Hierarchy bias

Route

bia

s

E

AE

AC

BPa

BPb

BPc

BPd

BPe

BF

BAct

BCya

BChl

BSpi

BVer

BThe

Bother

VdD

VsR

VdRm

VsD

VdDB

Vsat

Fig S3d3 A two dimensional projection of the three dimensional evolution space (Fig 3D): the Hierarchybias and Route bias plane. Bacteria (blue) and Archaea (red) are distinguished by and large in the presentfigure.

80


https://doi.org/10.1101/017988

−0.2 −0.15 −0.1 −0.05 0 0.05

0

1

2

3

Hierarchy bias

Flu

ctu

ation

E

AE

AC

BPa

BPb

BPc

BPd

BPe

BF

BAct

BCya

BChl

BSpi

BVer

BThe

Bother

VdD

VsR

VdRm

VsD

VdDB

Vsat

Fig S3d4 A two dimensional projection of the three dimensional evolution space (Fig 3D): the Hierarchy biasand Genomic codon distribution Fluctuation plane. Eukarya (green), Bacteria/Archaea, and Virus (yellow)are distinguished by and large in the present figure.

81


https://doi.org/10.1101/017988

−0.08−0.06−0.04−0.0200.02

0

1

2

3

Route bias

Flu

ctu

ation

E

AE

AC

BPa

BPb

BPc

BPd

BPe

BF

BAct

BCya

BChl

BSpi

BVer

BThe

Bother

VdD

VsR

VdRm

VsD

VdDB

Vsat

Fig S3d5 A two dimensional projection of the three dimensional evolution space (Fig 3D): the Route biasand Fluctuation plane. Eukarya (green), Bacteria (blue), Archaea (red), and Virus (yellow) are distinguishedby and large in the present figure.

82


https://doi.org/10.1101/017988

−0.06 −0.04 −0.02 0

−0.08

−0.06

−0.04

−0.02

0

0.02

Leaf node bias

Route

bia

s

E

AE

AC

BPa

BPb

BPc

BPd

BPe

BF

BAct

BCya

BChl

BSpi

BVer

BThe

Bother

VdD

VsR

VdRm

VsD

VdDB

Vsat

Fig S3d6 Distinguish Crenarchaeota (AC) and Euryarchaeota (AE) in the Leaf node bias and Route biasplane.

83


https://doi.org/10.1101/017988

−0.06 −0.04 −0.02 0

0

1

2

3

Leaf node bias

Flu

ctu

ation

E

AE

AC

BPa

BPb

BPc

BPd

BPe

BF

BAct

BCya

BChl

BSpi

BVer

BThe

Bother

VdD

VsR

VdRm

VsD

VdDB

Vsat

Fig S3d7 Distinguish Crenarchaeota (AC) and Euryarchaeota (AE) in the Leaf node bias and Fluctuationplane.

84


https://doi.org/10.1101/017988

001BCyanoB060BF

426BCyanoB069Z

070BChloro071BChloro

032BCyanoB482Z

481BCyanoB637BPc

220Z222BPc224BPc221BPc

223Z048BF

513BChloro

514BChloro062BF

063Z

182BChloro036BPa

242BPd

407AEuryA

235BChlofl

038BF

756BActinB

757BActinB

233BChlofl

234BChlofl

713BPe

700BF724Z078Z

720BCyanoB140Z

282BF803Z324Z

181BChloro

528BChloro

300BF

183BChloro

501BChloro

301BF

474BPc

518Z

522BCyanoB

715BCyanoB

061Z

065Z

064BF

072BChloro

243BF

212BF

230BCyanoB

229BCyanoB

163BF

411Z

415AEuryA

726Z

240BF

241BF

492BChloro

367BPc

369BPc

368BPc

370Z

818E15

258BOther

010BPc

014BPc

530Z

011BPc

012BPc

013BPc

496Z

027BPc

015BPc

314Z

316BPc

315BPc

317BPc

313BPc

361Z

211BPc

142BPe

356BF

261BF

319BPc

380Z

766BPc

767BPc

160BPc

231BChloro

486BF

379BF

002BF

446BF

459BF

760BF

761Z

762BOther

037BPa

285BChloro

386BF

049BF

051Z

050Z

059Z

052BF

053BF

067Z

066BF

055BF

068BF

054Z

058BF

056BF

172BV

erruc057B

F

228BCyanoB

583BPa

589BPa

587Z

592BPa

591BPa

588BPa

470BPa

681BF

684BF

682BF

685BF

683BF

686BF

471BPe

759BO

ther

137BO

ther516Z

462BF

519BC

yanoB493Z

525BC

yanoB

526BC

yanoB202B

F

283BT

hermo

167Z168Z

170BV

erruc

169BV

erruc

171BV

erruc

175BV

erruc

177BV

erruc

178BV

erruc

176BV

erruc752B

Spir

173Z

174BV

erruc584B

Pa

585BPa

484Z776B

Pa

208BF

376Z377Z

311BC

hloro

743BT

hermo

372Z373Z

755BC

yanoB

399AE

uryA044B

F362B

F357B

F403A

Eury

A404A

Eury

A402A

Eury

A405A

Eury

A586B

Pa

110B

Pc

114Z

113B

Pc

143Z

144B

Pe

145B

Pe

158B

F451B

F452B

F453B

F146B

Pe

147B

Pe

366B

Pd

488Z

284B

F590Z

593Z

148B

Chlo

ro503B

Therm

o149B

Chlo

ro775B

Pa

774Z

327Z

337B

Aquifi

406A

Eury

A712B

Aquifi

073B

Pa

328B

Pe

333B

Pe

329B

Pe

332B

Pe

330B

Pe

331B

Pe

141Z

363Z

326B

Pe

703B

F704B

F705B

F364B

F365B

F318B

Pc

381Z

382Z

383B

F384B

F359Z

360Z

497B

F687B

F698B

F694B

F693B

F697B

F689B

F699B

F6

96

BF

68

8B

F6

90

BF

69

5B

F6

92

BF

69

1B

F385B

F680Z

074B

Pa

075B

Pa

076B

Pa

701B

F702B

F678B

F679Z

139B

Pe

457Z

077Z

162B

Pc

447B

F449Z

151B

Pc

675Z

676B

F677B

F355B

F286B

Chlo

ro654Z

659B

F660B

F661B

F666B

F667B

F658B

F655Z

662B

F657B

F664B

F656B

F663B

F665B

F671Z

348B

F751Z

287Z

293Z

670Z

288Z

295Z

292B

Pc

294Z

289B

Pc

290Z

291B

Pc

668B

F669B

F029B

F030B

F738Z

203B

F210B

F413Z

374B

Spir

375B

Spir

412Z

138A

Cre

nA

730Z

507A

Eury

A729B

F73

1BF

248A

Cre

nA

737Z

416B

Oth

er

397Z

744Z

745B

The

rmo

746B

The

rmo

039Z

155A

Oth

erA

396A

Cre

nA

557Z

559Z

558Z

041B

Pe

450Z

157B

Pa

109Z

159B

Oth

er

414Z

398A

Eur

yA

477A

Cre

nA

112Z

150B

Pc

393B

F44

8Z25

3BPa

254B

Pa

255B

Pa25

6Z25

7Z49

0BPa

491B

Pa09

0BSp

ir

094B

Spir

095B

Spir

092B

Spir

097B

Spir

091Z

111B

Pc

204B

F29

9Z46

1Z09

3BSp

ir

096B

Spir

773Z

192Z

194Z

196B

F19

7BF

195B

F19

8BF

201B

F

205Z

207B

F

400A

EuryA

735B

Oth

er

206B

F

209B

F

199B

F

200B

F

740B

Therm

o

741BTherm

o

515BCyanoB

521BCyanoB

524BCyanoB

517Z

520BCyanoB

523BCyanoB

193BF

672ACrenA

250BOther

251BOther

401Z

709ACrenA

710Z

711Z

152BPc

161BChloro

454BF

458Z

455Z

456Z

023Z

554ACrenA

409Z

555ACrenA

739ACrenA

736ACrenA

410AEuryA

553Z

556ACrenA

777Z

040Z

733Z

734AEuryA

341ACrenA

804E1

808E5

809E6

805E2

819E16

807E4

806E3

812E9

816E13

813E10

817E14

814E11

815E12

810E7

811E8

003BPa

043BActinB

420BPa

421BPa

489BOther

483BPa

653BPa

264BPa

425BPc

582BPa

645BPa

046Z

578BPa

580BPa

577BPa

575BPa

576BPa

098BPa

100BPa

308Z309BPa

099Z245BPd

246BPd

390BPa

395Z487BPa

008BPb

045BPb

122BPb

123BPb

124BPb

125BPb

579BPa

219BActinB

534BPc

535BPc

536BPc749BPb533Z214BActinB

238BPb009BPb540BPc560BPb086Z087Z564Z028BPc325BPc088Z225BPb187Z247Z388BPa296BActinB297BActinB298BActinB601BActinB602BActinB573BActinB378Z429BActinB430Z433BActinB445BActinB434BActinB435BActinB438BActinB439BActinB673BPc674BPc021Z080BActinB084BActinB082Z083BActinB081BActinB218BActinB085BPb232BPb128BPb129BPb136BPb307Z

321AEuryA509BPb510BPb539BPc543BPc544Z541BPc542BPc089Z538BPc130BPb132BPb134BPb131BPb133BPb463Z126Z127Z545BPc561BPb763BPb562BPb563BPb115BPb116BPb117BPb118BPb121Z119BPb120BPb135BPb188BPc537BPc428Z431Z432BActinB442BActinB

441BActinB440BActinB

443BActinB

437BActinB

784BPc785BPc

786BPc779BPc

780BPc781BPc

782BPc783Z444BActinB

322AEuryA

323Z004BPc005BOther

244Z391BPc596BPa

722Z716BCyanoB

485BPa312Z472BPa

527BActinB

721BCyanoB

594BChlofl

595BChlofl

022Z179BChloro

184BChloro

464BPb465BPb

466BPb467BPb

469BPb468BPb

215Z216Z217BActinB

303BPd

742Z320BPc

418BPb

498Z352Z353BF006BA

cidoB

101Z102BPa

106Z105Z

108BPa

103BPa

104Z107B

Pa

153Z

499BPd

334BF

758AEuryA

024Z

650BA

cidoB

568BPa

342BPa

042BA

ctinB

310BPa

473BPa

723BC

yanoB

646BPa

154BF

226Z

500BC

hloro

339BPa

566BPa

567BPa

569BPa

424BPa

025BO

ther

502BF

156AEuryA

529Z

239BPd

427BF

227Z

306Z

753Z

754BSpir

338Z

727BPd

079Z

478BPb

302BPd

304BPd

305Z

408AE

uryA

394BPa

647BPa

725BPd

495BPa

648Z

007BA

ctinB

047BPa

236BD

einoc

460BPd

581BPa

651Z

652BP

a

714Z

164Z

165B

Pa

252B

Pa

494B

Pa

237Z

778B

Pa

600B

Chlo

ro

371Z

422B

Pa

732B

Actin

B

419B

Pa

190Z

191B

Actin

B

347B

Actin

B

707Z

708Z

480B

Actin

B

417B

Pb

479Z

423B

Pa

599B

Actin

B

706Z

504B

Pa

570Z

571B

Pa

572B

Pa

597B

Actin

B

016B

Pc

392B

Pc

505B

Pc

770B

Pc

552B

Pc

768B

Pc

031B

Pc

627B

Pc

632B

Pc

625B

Pc

634B

Pc

636B

Pc

531B

Pc

633B

Pc

769Z

511B

Pb

512Z

750B

Pc

186B

Chlo

ro

340B

Pc

598B

Pc

620B

Pc

630B

Pc

629B

Pc

764Z

765B

Pc

349B

F350Z

351B

F621B

Pc

622B

Pc

623B

Pc

624B

Pc

631Z

771B

Pc

772B

Pc

791Z

79

2B

Pc

79

7B

Pc

79

3B

Pc

79

4B

Pc

79

5B

Pc

79

6B

Pc

798B

Pc

800B

Pc

799B

Pc

801B

Pc

802B

Pc

017B

Pc

018B

Pc

019B

Pc

389B

Pc

532B

Pc

551B

Pc

020B

Pc

358Z

626B

Pc

549Z

550B

Pc

475Z

476Z

506Z

635B

Pc

026B

Pc

166B

Pc

619B

Pc

213Z

387B

Oth

er

180B

Chlo

ro

354B

F

628B

Pc

189B

Pc

259B

Pc

263Z

335B

Pb

343B

Pb

649B

Pc

262Z

265Z

267B

Pc

268B

Pc

266B

Pc

279B

Pc

270B

Pc

281B

Pc

269Z

271B

Pc

273Z

272Z

280Z

278B

Pc

277B

Pc

274Z

275B

Pc

276B

Pc

641Z

642B

Pc

643B

Pc

603Z

604Z

614B

Pc

607B

Pc

617Z

609B

Pc

611B

Pc

605Z

606B

Pc

612B

Pc

608B

Pc

613B

Pc

615B

Pc

616Z

610B

Pc

639B

Pc

644B

Pc

638B

Pc

618B

Pc

640Z

787Z

788B

Pc

789B

Pc

790B

Pc

336B

Chl

ofl

260B

Pc

345B

Pc

346Z546Z548Z547Z

574B

Pb

185B

Chl

ofl

565B

Act

inB

717B

Cya

noB

718B

Cya

noB

719B

Cya

noB

249B

Pc

436Z508Z

873V

dD

909V

dD

931V

dD

728B

Pb

875V

dD

919V

dD

825V

dD

866V

dD

895V

dD

896V

dD

950V

dD

951V

dD

827V

dD

821V

dD

823V

dD

822V

dD

824V

dD

826V

dD

901V

dD

929V

dD

925V

dD

927V

UP

834V

UP

913V

dD

936V

sR

928V

dD

940V

dD

935V

sR

937V

sR

902V

dD

941VdD

946VdD

908VdD

033BPd

035BPd

344BAct

inB

034BPd

887VdD889VdD

890VdD891VdD

888VdD881VdD

918VdD920VdD

921VdD900VdD

907VdD882VdD

884VdD915VdD

917VdD916VdD

867VdD870VdD

869VdD871VdD

868VdD906VdD

938VdD883VdD

874VdD914VdD

922VdD

747BDeinoc

748BDeinoc820VUC

880VsRe

897VUC855VdRm

832VsDm

876VsD905VsD

o939VUC933VsR

828VdDB830VsDc

863VdRm947VsR833VsD948VdRr837VdRm949VdRr

904VsDo911VsDcl930VsD835VUC952VSat893VsD899VsD846VdRm848VdRm894VRT924VdD945VsR903VsD923VsD926VdD934VSat829VdDB845VdRm864VdRm912VsDcl836VdRm856VdRm857VdRm847VdRm839VdRm862VdRm838VdRm842VdRm851VdRm859VdRm843VdRm849VdRm942VsD943VsD944VsR840VdRm854VdRm853VdRm860VdRm852VdRm861VdRm841VdRm858VdRm850VdRm

885VsRb844VdRm865VdRm886VsRb831VsD932VsR

878VsRe879VsRe898VUC892VsR872VUP910VdD877VUV

Fig S3d8 Tree of species based on the distances among species in the nondimentionalized evolution spaceFig 3D. This tree agrees with the three-domain tree by and large. Please enlarge to see details. Refer to thelegend of Fig 3D in the appendix to see the abbreviations of the taxa.

85


https://doi.org/10.1101/017988

1. Virus2. Bacteria

4. Eukarya

3. Crenarchaeota

3. Euryarchaeota

Fig S3d9 Tree of Bacteria, Crenarchaeota, Euryarchaeota, Eukarya and Virus based on the average distancesamong these taxa in the nondimentionalized evolution space Fig 3D. The origin order of these taxa is obtainedbased on the result in Fig 3C. This tree agrees with the three-domain tree rather than the eocyte tree.

86


https://doi.org/10.1101/017988

Bee

Dog

Zebrafish

Thalecress

Beetle

Nematode

Cow

Human

Chimpanzee

M. Mulatta

Rat

Opossum

Mouse

Horse

Chicken

Cerevisiae

Fig S3d10 Tree of eukaryotes based on their distances in the nondimentionalized evolution space Fig 3D.This tree is considerably reasonable.

87


https://doi.org/10.1101/017988

600 550 500 450 400 350 300 250 200 150 100 50 0

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Logarith

m o

f num

ber

of genera

Cm O S D C P Tr J K Pg NEdiacaran

Time (Ma)

Trend of genome size evolution: Ngenome

(t) = exp( − t ⋅ log2 / + 21.5 )

Trend of biodiversity evolution: Ngenus

(t) = exp( − t ⋅ log2 / + 8 )

177.8

172.7

≈

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Origin

al lo

garith

mic

mean o

f genom

e s

ize G

orilo

gM

ean

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Fig 4A Explanation of the trend of the Phanerozoic biodiversity curve by the trend of genome size evolution.The exponential growth rate of the Phanerozoic biodiversity curve is about equal to the exponential growthrate of genome size evolution (indicated in red). Refer to Fig S4a7 and Fig S4a8 to see the exponential trendof the genome size evolution and the exponential trend of the Phanerozoic biodiversity, respectively.

88


https://doi.org/10.1101/017988

0

10

20

30

40

50

60

70

80

genome size (base pairs)

Nu

mb

er

of

sp

ecie

s

107

108

109

1010

1011

GlogMean

Database←

Angiosperm

Gymnosperm

Pteridophyte

Bryophyte

Deuterostomia

Protostomia

Diploblostica

Fig S4a1 Logarithmic normal distribution of genome sizes for taxa. The genome size distribution for allthe species in the two databases satisfies logarithmic normal distribution very well, due to the additivity ofnormal distributions. The logarithmic mean of genome sizes for all the species in the two databases is markedin the present figure, which represents the intercept of the trend of genome size evolution in Fig S4a7.

89


https://doi.org/10.1101/017988

time t

logarithm of genome size

Simulation of genome size evolution: Taxon A

tA

origin

tA

GA

orilogMeanG

A

logMean ± G

A

logSD

Ancestor ofTaxon A Taxon A

Simulation of genome size evolution: Taxon B

tB

origin < t

A

origin

tB

GB

orilogMeanG

B

logMean ± G

B

logSD

GA

orilogMeanG

A

logMean G

A

logSD

< = >

Ancestor ofTaxon B Taxon B

Fig S4a2 Simulation of the evolution of genome size. In the simulation, the genome size statistically doublesin the stochastic process model, which results in the logarithmic genome size distributions. The invarianceof the genomic codon distributions after genome duplication is considered in the model (Fig S3a4). Twotaxa A and B are compared in the simulation. The greater the logarithmic standard deviation of genome sizesin a taxon is (GB

logS D > GAlogS D), the earlier the taxon originated (tB

origin <tAorigin). Besides, the less the original

logarithmic mean of genome size is (GBorilogMean < GA

orilogMean), the earlier the taxon originated (tBorigin <tA

origin).

90


https://doi.org/10.1101/017988

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

16

17

18

19

20

21

22

23

24

Lo

ga

rith

mic

me

an

of

ge

no

me

siz

e G

logM

ean

Logarithmic standard deviation of genome size GlogSD

GlogMean

for taxa

GorilogMean

for taxa

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

16

17

18

19

20

21

22

23

24

Orig

ina

l lo

ga

rith

mic

me

an

of

ge

no

me

siz

e G

orilo

gM

ean

Fig S4a3 The greater the logarithmic standard deviation of genome sizes GlogS D is, the less the originallogarithmic mean of genome sizes GorilogMean is, and meanwhile also the less the logarithmic mean of genomesizes GlogMean is. These statistical analyses of genome sizes for the taxa are obtained based on the genomicdata in the two databases for animals and plants.

91


https://doi.org/10.1101/017988

13 14 15 16 17 18 19 20 21 22 23 24 25 26 2713

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Archeae

Bacteria

Algae&Protists

Diploblostica

Protostomia

Deuterostomia

Bryophyte

Pteridophyte

Gymnosperm

Angiosperm

GlogMean

± χ * GlogSD

for taxa

GlogMean

± χ * GlogSD

for lower taxa

minimum through maximum

predicted upper limit of logarithmic genome size

Original logarithmic mean of genome size GorilogMean

Lo

ga

rith

mic

va

ria

tio

n o

f g

en

om

e s

ize

Fig S4a4 The less the original logarithmic mean of genome sizes GorilogMean is, the less the logarithmic meanof genome sizes GlogMean is (red), and meanwhile the greater the logarithmic standard deviation of genomesizes GlogS D is (green or blue). The original logarithmic mean of genome sizes GorilogMean indicate the origintime of taxon (Fig S4a2). The convergent point (namely the upper vertex of the triangle area in the presentfigure) indicates the upper limit of logarithmic genome sizes at present.

92


https://doi.org/10.1101/017988

No. Superphylum Taxon number of records GlogMean

GlogSD

GorilogMean

1 Protostomia Nematodes 66 −2.394 0.93204 −3.8552

2 Deuterostomia Chordates 5 −1.8885 0.91958 −3.3301

3 Diploblostica Sponges 7 −1.0834 1.3675 −3.2272

4 Diploblostica Ctenophores 2 −0.010305 1.6417 −2.584

5 Protostomia Tardigrades 21 −1.2168 0.7276 −2.3574

6 Protostomia Misc_Invertebrate 57 −0.75852 0.96321 −2.2686

7 Protostomia Arthropod 1284 −0.078413 1.2116 −1.9778

8 Protostomia Annelid 140 −0.14875 0.9258 −1.6001

9 Protostomia Myriapods 15 −0.54874 0.66478 −1.5909

10 Protostomia Flatworms 68 0.15556 1.0701 −1.522

11 Protostomia Rotifers 9 −0.51158 0.55134 −1.3759

12 Diploblostica Cnidarians 11 −0.16888 0.69379 −1.2565

13 Deuterostomia Fish 2045 0.23067 0.6559 −0.7976

14 Deuterostomia Echinoderm 48 0.11223 0.52794 −0.71542

15 Protostomia Molluscs 263 0.5812 0.5493 −0.27994

16 Deuterostomia Bird 474 0.32019 0.13788 0.10403

17 Deuterostomia Reptile 418 0.78332 0.28332 0.33916

18 Deuterostomia Amphibian 927 2.4116 1.081 0.71691

19 Deuterostomia Mammal 657 1.1837 0.2401 0.80727

Fig S4a5 The three animal superphyla Diploblostica, Protostomia and Deuterostomia have been roughlydistinguished based on the order of the original logarithmic means of genome sizes GorilogMean for the animaltaxa (Fig S4a4). This supports that the order of GorilogMean indicate the evolutionary chronology.

93


https://doi.org/10.1101/017988

No. Class Taxon GlogMean

GlogSD

GorilogMean

1 Dicotyledoneae Lentibulariaceae −1.0532 0.88349 −2.43822 Monocotyledoneae Cyperaceae −0.81211 0.61307 −1.77323 Dicotyledoneae Cruciferae −0.62192 0.6855 −1.69664 Dicotyledoneae Rutaceae −0.22121 0.93413 −1.68565 Dicotyledoneae Oxalidaceae 0.19774 1.1445 −1.59646 Dicotyledoneae Crassulaceae −0.26578 0.82268 −1.55557 Dicotyledoneae Rosaceae −0.40468 0.62511 −1.38478 Dicotyledoneae Boraginaceae −0.20664 0.68081 −1.27399 Dicotyledoneae Labiatae −0.0021905 0.80883 −1.270210 Monocotyledoneae Juncaceae −0.23032 0.63698 −1.228911 Dicotyledoneae Vitaceae −0.60987 0.39049 −1.22212 Dicotyledoneae Cucurbitaceae −0.26487 0.60779 −1.217713 Dicotyledoneae Onagraceae 0.040848 0.78018 −1.182214 Dicotyledoneae Leguminosae 0.33968 0.88684 −1.050615 Dicotyledoneae Myrtaceae −0.37801 0.42511 −1.044516 Monocotyledoneae Bromeliaceae −0.56838 0.29232 −1.026617 Dicotyledoneae Polygonaceae 0.20985 0.76174 −0.9843318 Dicotyledoneae Euphorbiaceae 0.72687 1.0796 −0.9656119 Dicotyledoneae Convolvulaceae 0.50052 0.928 −0.954320 Dicotyledoneae Chenopodiaceae −0.046809 0.5526 −0.9131221 Dicotyledoneae Plantaginaceae −0.15021 0.48422 −0.9093222 Dicotyledoneae Rubiaceae −0.084413 0.51565 −0.892823 Dicotyledoneae Caryophyllaceae 0.27683 0.65869 −0.755824 Dicotyledoneae Amaranthaceae 0.15834 0.58176 −0.7536925 Dicotyledoneae Malvaceae 0.39517 0.47109 −0.3433626 Monocotyledoneae Zingiberaceae 0.24819 0.36317 −0.3211527 Monocotyledoneae Iridaceae 1.3429 1.0491 −0.3017828 Dicotyledoneae Umbelliferae 0.71003 0.6235 −0.2674229 Dicotyledoneae Solanaceae 0.78034 0.66585 −0.2635230 Monocotyledoneae Orchidaceae 1.4063 1.0551 −0.2478431 Monocotyledoneae Araceae 1.5174 1.012 −0.06915232 Dicotyledoneae Papaveraceae 0.93206 0.61932 −0.03885433 Dicotyledoneae Compositae 1.0741 0.70726 −0.03465734 Monocotyledoneae Gramineae 1.4002 0.84476 0.07589435 Dicotyledoneae Cactaceae 0.98813 0.57251 0.09060836 Monocotyledoneae Palmae 1.1222 0.63488 0.1269137 Dicotyledoneae Passifloraceae 0.52209 0.22472 0.1697938 Dicotyledoneae Orobanchaceae 1.12 0.54393 0.2672639 Dicotyledoneae Cistaceae 0.88905 0.30458 0.4115540 Monocotyledoneae Asparagaceae 2.0053 0.78802 0.7699141 Dicotyledoneae Asteraceae 1.8795 0.67031 0.8286342 Dicotyledoneae Ranunculaceae 2.0285 0.72517 0.891643 Monocotyledoneae Agavaceae 1.6207 0.4537 0.9094144 Monocotyledoneae Hyacinthaceae 2.3635 0.69028 1.281445 Dicotyledoneae Loranthaceae 2.3797 0.68478 1.306246 Monocotyledoneae Commelinaceae 2.5322 0.64196 1.525847 Monocotyledoneae Amaryllidaceae 2.9085 0.5811 1.997548 Monocotyledoneae Xanthorrhoeaceae 2.7036 0.4 2.076549 Monocotyledoneae Asphodelaceae 2.8054 0.33968 2.272950 Monocotyledoneae Alliaceae 2.9051 0.39078 2.292451 Dicotyledoneae Paeoniaceae 2.957 0.28164 2.515552 Monocotyledoneae Liliaceae 3.5678 0.63278 2.575753 Monocotyledoneae Aloaceae 2.9724 0.2203 2.627

Fig S4a6 The two angiosperm classes Dicotyledoneae and Monocotyledoneae have been roughlydistinguished based on the order of the original logarithmic means of genome sizes GorilogMean for theangiosperm taxa (Fig S4a4). This also supports that the order of GorilogMean indicate the evolutionarychronology.

94


https://doi.org/10.1101/017988

600 550 500 450 400 350 300 250 200 150 100 50 0

19

20

21

22

23

Lo

ga

rith

mic

me

an

of

ge

no

me

siz

e G

logM

ean

Diploblostica

Protostomia

Deuterostomia

Bryophyte

Pteridophyte

Gymnosperm

Angiosperm

Trend of GlogMean

Trend of GorilogMean

GorilogMean

= GlogMean

− χ ⋅ GlogSD

GlogMean

Database→such that

Cm O S D C P Tr J K Pg NEdiacaran

Time (Ma)

19

20

21

22

23

Orig

ina

l lo

ga

rith

mic

me

an

of

ge

no

me

siz

e G

orilo

gM

ean

19

20

21

22

23

19

20

21

22

23

19

20

21

22

23

19

20

21

22

23

19

20

21

22

23

19

20

21

22

23

Fig S4a7 The exponential trend of genome size evolution based on the relation between the origin time oftaxa and their original logarithmic means of genome sizes. The red regression line of GlogMean against theorigin time should be shifted downwards to the blue regression line of GorilogMean against the origin time,by considering the inverse relationship between GlogS D and the origin time (Fig 4a2). The parameter χ inthe definition of GorilogMean is determined by letting the intercept of the blue regression line be the averagelogarithmic genome size GDatabase

logMean (Fig 4a1).

95


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−2

−1

0

1

2

3

4

5

6

7

8

9

Time (Ma)

Lo

ga

rith

m o

f n

um

be

r o

f g

en

era

Cm O S D C P Tr J K Pg N

Phanerozoic biodiversity curve = Fluctuation of biodiversity × Trend of biodiversity

Fluctuation of biodiversity

Trend of biodiversity: Ngenus

(t) = exp( − t ⋅ log2 / 172.7 + 8 )

500 450 400 350 300 250 200 150 100 50 0

1

10

100

1000

10000

Nu

mb

er

of

ge

ne

ra

Fig S4a8 The exponential trend of Bambach et al.’s Phanerozoic biodiversity curve. The unnormalisedbiodiversity net fluctuation curve (thin light green) is obtained by subtracting the linear trend of logarithmicbiodiversity curve (thin dark green) from the logarithmic biodiversity curve (thick dark green).

96


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−3

−2

−1

0

1

2

3

Time (Ma)

Rela

tive v

ariation


O−

S

F−

F

P−

Tr

Tr−

J

K−

Pg

G−

L

Biodiversity

Climato−eustatic curve

Fig 4B Explanation of the fluctuations of the Phanerozoic biodiversity curve based on climate change andsea level fluctuations, which indicates the tectonic cause of the mass extinctions. Refer to Fig S4b4 and FigS4a8 to see the climato-eustatic curve and the biodiversity net fluctuation curve, respectively.

97


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0


Time (Ma)

Clim

atic g

rad

ien

t b

ase

d o

n c

lima

te in

dic

ato

rsLow

Mod

Mod H

iH

iV

ery

Hi

Warm

Cool w

inte

rsF

reezin

g w

inte

rsU

nip

ola

r gla

cia

tion

Bip

ola

r gla

cia

tion

Te

mp

era

ture

Fig S4b1 Climatic gradient curve in the Phanerozoic eon based on climate indicators. The climatic gradientcurve is opposite to the climate curve.

98


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−4

−3

−2

−1

0

1

2

3

4

Time (Ma)

Re

lative

te

mp

era

ture

average Climate change

Cm O S D C P Tr J K Pg N4

3

2

1

0

−1

−2

−3

−4

Re

lativ

e c

lima

te g

rad

ien

t

Climate indicators

CO2

87Sr/

86Sr

δ18

O

Fig S4b2 The Phanerozoic climate curve is obtained by averaging the four independent results (climateindicators, Berner’s CO2, marine 87S r/86S r and marine δ18O).

99


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−4

−3

−2

−1

0

1

2

3

4

Time (Ma)

Re

lative

se

a le

ve

l

average Sea level


Hallam

Haq

Fig S4b3 The Phanerozoic eustatic curve is obtained by averaging the Hallam’s and Haq’s results on sealevel fluctuations.

100


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−4

−3

−2

−1

0

1

2

3

4

Time (Ma)

Re

lative

va

ria

tio

n

average Climato−eustatic curve


Sea level

Climate gradient

Fig S4b4 The climato-eustatic curve is obtained by averaging the Phanerozoic climate gradient curve (FigS4b2) and the Phanerozoic eustatic curve (Fig S4b3).

101


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−4

−3

−2

−1

0

1

2

3

4

Time (Ma)

Re

lative

va

ria

tio

n


Sea level

Climate gradient

Biodiversity

Fig S4b5 Relationships among the eustatic curve (Fig S4b3), the climate gradient curve (Fig S4b2) and thebiodiversity net fluctuation curve (Fig S4a8) in the Phanerozoic eon.

102


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−0.4

−0.2

0

0.2

0.4

Time (Ma)

Va

ria

tio

n r

ate

s


Sea level variation rate

Climate gradient variation rate

Biodiversity variation rate

Fig S4b6 Variation rate of the eustatic curve, variation rate of the climate gradient curve and variation rateof the biodiversity net fluctuation curve in the Phanerozoic eon (the derivative curves of the three curves inFig S4b5).

103


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−0.4

−0.2

0

0.2

0.4

Time (Ma)

Va

ria

tio

n r

ate

s


O−

S

F−

F

P−

Tr

Tr−

J

K−

Pg

G−

L

Climato−eustatic variation rate

origination rate − extinction rate

Biodiversity variation rate

Fig S4b7 Variation rate of the climato-eustatic curve and variation rate of the biodiversity net fluctuationcurve in the Phanerozoic eon (the derivative curves of the two curves in Fig 4B). The five mass extinctions(including two phases in the P-Tr extinction) are marked in the present figure (Fig 4B), and the minorextinctions can also be observed in the variation rate of the climato-eustatic curve. Additionally, the differencebetween the origination rate and the extinction rate is plotted (Fig S4d3).

104


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

0

1000

2000

3000

4000

5000

6000

Time (Ma)

Num

ber

of genera


Predicted biodiversity curve

Phanerozoic biodiversity curve

All genera in Sepkoski’s Compendium

Well−resolved genera in Sepkoski’s Compendium

Fig 4C Reconstruction of the Phanerozoic biodiversity curve based on genomic, climatic and eustatic data,whose result agrees with the biodiversity curve based on fossil records. The Bambach et al.’s Phanerozoicbiodiversity curve (Fig S4a8) based on fossil records is in green, and the predicted biodiversity curve in pinkis based on the climato-eustatic curve in Fig S4b4 and the exponential trend of genome size evolution in FigS4a7. Refer to Fig S4c1 for details.

105


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−2

−1

0

1

2

3

4

5

6

7

8

9

10

Time (Ma)

Lo

ga

rith

m o

f n

um

be

r o

f g

en

era


Phaneroazoic biodiversity curve = Fluctuation of biodiversity × Trend of biodiversity

Fluctuation of biodiversity

Trend of biodiversity: Ngenus

(t) = exp( − t ⋅ log2 / 172.7 + 8 )

Predicted biodiversity curve = Predicted fluctuation of biodiversity × Predicted trend of biodiversity

Predicted fluctuation of biodiversity = Climato−eustatic curve × standard deviation of Fluctuation of biodiversity

Predicted trend of biodiversity: Ngenus

predict(t) = exp( − t ⋅ log2 / 177.8 + 8 )

Climate−eustatic curve

500 450 400 350 300 250 200 150 100 50 0

1

10

100

1000

10000

Nu

mb

er

of

ge

ne

raFig S4c1 Detailed explanation of the reconstruction of the Phanerozoic biodiversity curve based on genomic,climatic and eustatic data (Fig 4A, 4B, 4C). The exponential growth trend of the predicted biodiversity curve(obtained based on the exponential trend in genome size evolution in Fig S4a7) corresponds to the exponentialgrowth trend in the Phanerozoic biodiversity curve (obtained based on the fossil records in Fig S4a8); Thefluctuations of the predicted biodiversity curve (obtained based on the climato-eustatic curve in Fig S4b4)corresponds to the biodiversity net fluctuation curve (obtained based on the fossil records in Fig S4a8).

106


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

0

0.2

0.4

0.6

0.8

1

Time (Ma)

Origin

ation r

ate


Predicted origination rate

Phanerozoic origination rate

Fig 4D Explanation of the declining origination rate through the Phanerozoic. Refer to Fig S4d3 to seethe declining origination rate and extinction rate based on fossil records; and refer to Fig S4d2 to see theexplanation of these declining rates.

107


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

0

100

200

300

Time (Ma)

Orig

ina

tio

n o

f g

en

era


Predicted origination rate × Predicted trend of biodiversity

Constraint of origination ability

Fig S4d1 Constraint of origination ability of life on Earth, which results in the declining origination ratethrough the Phanerozoic eon (Fig 4D).

108


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

Time (Ma)

Pre

dic

ted

va

ria

tio

n r

ate


Predicted origination rate = c’ (Climato−eustatic variation rate + c’’) / Predicted trend of biodiversity

Predicted extinction rate

Predicted origination rate − Predicted extinction rate

Fig S4d2 Explanation of the declining origination rate and extinction rate in the Phanerozoic eon, accordingto the reconstruction of the Phanerozoic biodiversity curve (Fig S4c1). The declining rates are owing to theexponential growth trend driven by the genome size evolution.

109


https://doi.org/10.1101/017988

500 450 400 350 300 250 200 150 100 50 0

−0.2

0

0.2

0.4

0.6

0.8

Time (Ma)

Va

ria

tio

n r

ate


Phanerozoic origination rate

Phanerozoic extinction rate

Phanerozoic origination rate − Phanerozoic extinction rate

Fig S4d3 The declining origination rate and extinction rate in the Phanerozoic eon based on fossil records,which can be explained in Fig 4D and Fig S4d2.

110


https://doi.org/10.1101/017988

Download - Concurrent origins of the genetic code and the …...2015/04/13 · 0 Overview The origins of the genetic code, the homochirality of life and the three domains of life are the critical

Top Related