Concurrent origins of the genetic code and the homochiralityof life, and the origin and evolution of biodiversity.
Part II: Technical appendix
Dirson Jian LiDepartment of Applied Physics, Xi’an Jiaotong University, Xi’an 710049, China
Abstract
Here is Part II of my two-part series paper, which provides technical details and evidence for Part Iof this paper (see “Concurrent origins of the genetic code and the homochirality of life, and the originand evolution of biodiversity. Part I: Observations and explanations” on bioRxiv).
Contents0 Overview 1
1 The roadmap for the evolution of the genetic code and the origin of homochirality of life 31.1 Initiation of the roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Expansion of the genetic code along the roadmap . . . . . . . . . . . . . . . . . . . . . . . 41.3 The ending of the roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Origin of tRNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Codon degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6 Origin of homochirality of life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.7 Evidence for the roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Origin of the three domains 182.1 Genomic codon distributions reveal the relationship between the evolution of the genetic code
and the tree of life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Features of genomic codon distributions: observations and simulations . . . . . . . . . . . . 202.3 The evolution space and the tree of life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Reconstruction of the Phanerozoic biodiversity curve 253.1 Explanation of the trend of the Phanerozoic biodiversity curve based on genome size evolution 253.2 Explanation of the variation of the Phanerozoic biodiversity curve based on climatic and
eustatic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Reconstruction of the Phanerozoic biodiversity curve and the strategy of the evolution of life 34
References 36
Index 41
Supplementary Figures 43
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0 Overview
The origins of the genetic code, the homochirality of life and the three domains of life are the critical eventsat the beginning of life. Most direct evidence for these primordial events has been erased; we have to turn tocircumstantial evidence in the genome sequences to reproduce the early history of life. In Part I, a roadmaptheory is proposed to explain the above events. It indicates that the genetic code and the homochiralityof life originated concurrently, which consequently determined the origin and evolution of the biodiversity.Evidence is the final arbiter for this theory. Here in Part II, based on biological and geological data, plenty ofevidence has been provided in detail from different perspectives.
Section 1 explains the roadmap for the evolution of the genetic code and the origin of the homochiralityof life (Fig 1, 2), which includes initiation, expansion and the ending of the roadmap (Fig 1A, 1B, 2B); theorigin of tRNAs (Fig 2A); cooperative recruitments of codons and amino acids (Fig 2B); roadmap symmetryand codon degeneracy (Fig 2C); chiral symmetry broken at the origin of life (Fig 2D); and a summary ofevidences that support the roadmap. Section 2 explains the origin of the three domains of life (Fig 3), whichincludes the relationship between the evolution of the genetic code and the tree of life based on the genomiccodon distributions (Fig 3A, S3a5); the origin order of Virus, Bacteria, Archaea and Eukarya according tothe recruitment order of codons (Fig 3C); construction of the evolution space; and construction of the treeof life based on the roadmap and genomic codon distributions (Fig 3D, S3d9). And Section 3 explainsthe Phanerozoic biodiversity curve qualitatively and quantitatively (Fig 4), which includes the split andreconstruction scheme for the Phanerozoic biodiversity curve (Fig 4A, 4B, 4C); the evolution of genomesize (Fig S4a2, S4a7); the Phanerozoic climatic curve (Fig S4b2); the Phanerozoic eustatic curve (Fig S4b3);tectonic cause of mass extinctions (Fig 4B); and explanation of the declining origination rate and extinctionrate (Fig 4D). Some new nomenclatures are defined when necessary in this paper, which are listed in theIndex.
There are 14 subfigures plus 53 supplementary subfigures to explain the origin and evolution of life basedon biological and geological data. Numbering rules are as follows: Appendix Section 1-3 correspond toSection 1-3 in the report respectively, and the supplementary subfigures Fig Sαβγ (α = 1, 2, 3, 4; β = a, b, c,d; γ = 1, 2, 3, ...) correspond to subfigures Fig αβ (α = 1, 2, 3, 4; β = A, B, C, D; with corresponding ‘αβ’)in the report respectively.
Section 1 (Appendix Section 1) : Fig 1, 2 (Fig S1, S2)
Fig 1A (Fig S1a1) , Fig 1B,
Fig 2A (Fig S2a1-3) , Fig 2B (Fig S2b1-5),
Fig 2C (Fig S2c1-3) , Fig 2D (Fig S2d1);
Section 2 (Appendix Section 2) : Fig 3 (Fig S3)
Fig 3A (Fig S3a1-6) , Fig 3B (Fig S3b1-3),
Fig 3C (Fig S3c1-2) , Fig 3D (Fig S3d1-10);
1
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
Section 3 (Appendix Section 3) : Fig 4 (Fig S4)
Fig 4A (Fig S4a1-8) , Fig 4B (Fig S4b1-7),
Fig 4C (Fig S4c1) , Fig 4D (Fig S4d1-3).
2
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
1 The roadmap for the evolution of the genetic code and the origin ofhomochirality of life
1.1 Initiation of the roadmap
The evolution of the genetic code can be divided into three stages: the initiation stage, the expansion stageand the ending stage.
In the beginning, there was an R (R denotes purine) single-stranded DNA Poly G (Fig 1A, 1B #1). Bycomplementary base pairing, a YR (Y denotes pyrimidine) double-stranded DNA Poly C · Poly G formed.And by base triplex CG ∗ G, a YR ∗ R1 triple-stranded DNA Poly C · Poly G ∗ Poly G formed (Fig 1A, 1B#1). The third R1 strand Poly G separated out of this YR ∗R1 triple-stranded DNA, which then formed a newY1R1 double-stranded DNA Poly C · Poly G. So far, there was only initial codon pair GGG · CCC (Fig 1A,1B #1).
Starting from this initial #1 codon pair and by the base substitutions throughout the four hierarchies, thebase pairs from #2 to #32 recruited so that the whole genetic code table is fulfilled (Fig 1). In this paper, thetechnical process is illustrated in the roadmap for the evolution of the genetic code (or the roadmap for short)(Fig 1A). The relative stabilities of all the 16 possible base triplexes are listed from weak to strong as follows[1] [2]:
(−) GC ∗ A, AT ∗C, AT ∗ A
(+) CG ∗G, T A ∗C, T A ∗ A, T A ∗G, GC ∗C, GC ∗G, AT ∗ T
(++) CG ∗ A, CG ∗ T, GC ∗ T
(3+) AT ∗G
(4+) CG ∗C, T A ∗ T
The relative stability of base triplexes played a crucial role in the base substitutions along the roadmap.
A weaker base triplex tends to be substituted spontaneously by a stronger one, in the course of repeatedcombinations and separations of strands for the triple-stranded DNA. Only three kinds of substitutions ofbase triplexes are practically required in the roadmap: substitution of (+) CG ∗ G by (++) CG ∗ A, with thetransition from G to A in the third R strand; substitution of (+) CG ∗G by (4+) CG ∗C, with the transversionfrom G to C in the third R strand; substitution of (+) GC ∗ C by (++) GC ∗ T , with the transition from C toT in the third R strand (Fig 1A, Sla1). The transition from G to A is of the most common substitution in theroadmap (Fig 1A). All the 8 codon pairs in Route 0 were formed by the transition from G to A. Consideringthe dislocation of bases in Poly C ·Poly G∗Poly G, only one transversion from G to C was enough to blazed anew path other than Route 0 by bringing the starting codon pairs GGC ·GCC, GCG ·CGC and CGG ·CCG forRoute 1, Route 2 and Route 3 at positions #2, #7 and #10 in the roadmap, respectively (Fig 1A). Consideringthe dislocation of bases similarly, only one transition from C to T was enough to blaze a new path so as to
3
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
form the remaining codon pairs by substitutions at positions #6, #19 and #12 in the roadmap, respectively (Fig1A). The length of the genetic code are not greater than 3 bases because such supposed genetic code cannotbe fulfilled by the demand of increasing stability of the base triplexes; namely the unavoidable instabilities ofthe base triplexes prohibited the k-base genetic code (k , 3) in practice.
In the initiation stage of the roadmap, the codon pairs from #1 to #6 were recruited by the roadmap, whichconstituted the initial subset of the genetic code:
#1 GGG ·CCC, #2 GGC ·GCC, #3 GGA ·UCC, #4 GAG ·CUC, #5 GAC ·GUC, #6 GGU ·ACC.
And in this stage were recruited the earliest 9 amino acids in order: 1Gly, 2Ala, 3Glu, 4Asp, 5Val, 6Pro,7S er, 8Leu, 9Thr, all of which belong to Phase I amino acids [3] [4]. Although the initial subset is small, thetwo essential features of the roadmap, pair connection and route duality, had taken shape in this initial stage(Fig 1A, 2B).
Pair connection is an essential feature of the roadmap. The connected codon pairs in the roadmapgenerally encode a common amino acid (Fig 1A, 2C). For instance, the pair connection #1−Gly−#2 indicatesthat both GGG in #1 and GGC in #2 encode the common amino acid Gly. Pair connections reveal the closerelationship between the recruitment of the codons and the recruitment of the amino acids.
Route duality is another essential feature of the roadmap, which shows the relationship of pair connectionsbetween different routes (Fig 1A, 2C). For instance, the route duality
#1 −Gly − #3 ∼ #2 −Gly − #6
indicates that the pair connection #1 −Gly − #3 in Route 0 and the pair connection #2 −Gly − #6 in Route 1are dual, which encode the common amino acid Gly. Route dualities generally exist between Route 0 andRoute 3, or between Route 1 and Route 2 (Fig 2C).
1.2 Expansion of the genetic code along the roadmap
The genetic codes evolved along the four routes Route 0 − 3 respectively; and the 8 codon pairs in eachroute evolved in the order of the four hierarchies Hierarchy 1 − 4 respectively (Fig 1A). The roadmap can bedivided into two groups: the early hierarchies Hierarchy 1 − 2 and the late hierarchies Hierarchy 3 − 4. Itcan also be divided into two groups: the initial route Route 0 (all-purine codons with all-pyrimidine codons)and the expanded routes Route 1 − 3 (Fig 1A, S3d2). These groupings will be used to explain the origin ofthe three domains in Subsection 2.3.
The genetic codes expanded spontaneously from the initial subset in the expansion stage of the roadmap.Each of the 6 codon pairs in the initial subset expanded to three codon pairs, respectively, by route dualities.
4
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
Details are as follows. The codon pair #2 in the initial subset expanded to the three continual codon pairs #7,#8 and #9 by route duality
#2 − Ala − #8 ∼ #7 − Ala − #9.
The codon pair #1 in the initial subset expanded to the three continual codon pairs #10, #11 and #12 by routeduality
#1 − Ala − #11 ∼ #10 − Ala − #12.
The codon pair #3 in the initial subset expanded to the three continual codon pairs #13, #14 and #15 by routeduality
#3 − Ala − #14 ∼ #13 − Ala − #15.
The codon pair #6 in the initial subset expanded to the three continual codon pairs #16, #17 and #18 by routeduality
#6 − Ala − #17 ∼ #16 − Ala − #18.
The codon pair #5 in the initial subset expanded to the three codon pairs #19, #24 and #26 by route duality
#5 − Ala − #26 ∼ #19 − Ala − #24.
The codon pair #4 in the initial subset expanded to the three codon pairs #20, #25 and #27 by route duality
#4 − Ala − #27 ∼ #20 − Ala − #25.
According to the base substitutions in the roadmap, the recruitment order of the codon pairs from #1 to#32 is as follows (Fig 1A):
#1 GGG ·CCC, #2 GGC ·GCC, #3 GGA ·UCC, #4 GAG ·CUC, #5 GAC ·GUC, #6 GGU ·ACC,#7 GCG ·CGC, #8 AGC ·GCU, #9 GCA ·UGC, #10 CGG ·CCG, #11 AGG ·CCU, #12 UGG ·
CCA, #13 CGA · UCG, #14 AGA · UCU, #15 UGA · UCA, #16 ACG · CGU, #17 AGU · ACU,#18 ACA ·UGU, #19 GUG ·CAC, #20 CAG ·CUG, #21 GAU ·AUC, #22 AUG ·CAU, #23 GAA ·
UUC, #24 GUA · UAC, #25 UAG ·CUA, #26 AAC ·GUU, #27 AAG ·CUU, #28 CAA · UUG,#29 AUA · UAU, #30 AAU · AUU, #31 UAA · AUU, #32 AAA · UUU;
and the recruitment order of the amino acids from No.1 to No.20 is as follows (Fig 1A):
No.1 Gly, No.2 Ala, No.3 Glu, No.4 Asp, No.5 Val, No.6 Pro, No.7 S er, No.8 Leu, No.9 Thr,No.10 Arg, No.11 Cys, No.12 Trp, No.13 His, No.14 Gln, No.15 Ile, No.16 Met, No.17 Phe,No.18 Tyr, No.19 Asn, No.20 Lys.
The recruitment order of the codon pairs and the recruitment order of the amino acids are intricatelywell organised and coherent, according to the subtle roadmap (Fig 1A, 2B). In the initiation stage, firstly,
5
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
the amino acid No.1 was recruited with the codon pair #1, remaining a vacant position. Subsequently, No.1and No.2 were recruited with the codon pair #2; No.1 was recruited with the codon pair #3, remaining avacant position; No.3 was recruited with the codon pair #4, remaining a vacant position; No.4 and No.5 wererecruited with the codon pair #5; No.6 filled up the vacant position of #1; No.7 filled up the vacant positionof #3; No.8 filled up the vacant position of #4; No.1 and No.9 were recruited with the codon pair #6 (Fig 2B).Thus the framework of the genetic code had been established at the end of the initiation stage. Hereafter, itwas always proper to recruit two amino acids that had been already recruited or were about to be recruitednext, such that no more positions remained vacant. Details are as follows. No.2 and No.10 amino acids wererecruited with the codon pair #7; and subsequently, No.2 and No.7 were recruited with #8; No.2 and No.11were recruited with #9; No.6 and No.10 were recruited with #10; No.6 and No.10 were recruited with #11;No.6 and No.12 were recruited with #12; No.7 and No.10 were recruited with #13; No.7 and No.10 wererecruited with #14; No.7 and stop were recruited with #15; No.9 and No.10 were recruited with #16; No.7and No.9 were recruited with #17; No.9 and No.11 were recruited with #18; No.5 and No.13 were recruitedwith #19; No.8 and No.14 were recruited with #20; No.4 and No.15 were recruited with #21; No.13 andNo.16 were recruited with #22; No.3 and No.17 were recruited with #23; No.5 and No.18 were recruitedwith #24; No.8 and stop were recruited with #25; No.5 and No.19 were recruited with #26; No.8 and No.20were recruited with #27; No.8 and No.14 were recruited with #28; No.15 and No.18 were recruited with #29;No.15 and No.19 were recruited with #30; No.8 and stop were recruited with #31; No.17 and No.20 wererecruited with #32 (Fig 2B).
Take for example from #1 to #29, the evolution of the genetic code along the roadmap can described indetails as follows (Fig 1A, 1B). Starting from the position #1 (Fig 1B #1), an R single-stranded DNA broughtabout a YR double-stranded DNA; next, the YR double-stranded DNA brought about a YR∗R1 triple-strandedDNA (the number 1 denotes #1, similar below); next, an R1 single-stranded DNA departed from the YR ∗ R1triple-stranded DNA; next, the R1 single-stranded DNA brought about a R1Y1 double-stranded DNA. Thus,the codon pair GGG ·CCC were achieved at #1. At the beginning of #7 (Fig 1B #7), the R1Y1 double-strandedDNA was renamed as Y1R1 double-stranded DNA, where the 180◦ rotation did not change the right-handedhelix; next, the Y1R1 double-stranded DNA brought about a Y1R1 ∗ R7 triple-stranded DNA, through thetransversion from G to C, where the stability (+) of CG ∗G increased to the stability (4+) of CG ∗C; next, anR7 single-stranded DNA departed from the Y1R1∗R7 triple-stranded DNA; next, the R7 single-stranded DNAbrought about a R7Y7 double-stranded DNA. Thus, the codon pair GCG ·CGC were achieved at #7. The caseof #19 is similar to #7 (Fig 1B #19); the codon pair GTG ·CAC were achieved through the transition from C
to T , where the stability (+) of GC∗C increased to the stability (2+) of GC∗T . The case of #24 is also similarto #7 (Fig 1B #24); the codon pair GT A · T AC were achieved through the common transition from G to A,where the stability (+) of CG ∗G increased to the stability (2+) of CG ∗ A. At the position #29 (Fig 1B #29),the codon pair GCG ·CGC in Y24R24 are non-palindromic in consideration that both GCG and CGC do notread the same backwards as forwards. In this case, a reverse operation is necessary so that the obtained codonpair CAT · ATG in y24r24 read reversely the same as the codon pair T AC · GT A in Y24R24. The processfrom y24r24 to R29Y29 is still similar to the case of #7; the codon pair AT A · T AT were achieved through
6
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
the transition from G to A, where the stability (+) of CG ∗G increased to the stability (2+) of CG ∗ A. Otherprocesses in the roadmap are similar to the above example (Fig 1A, 1B). The reverse operation is unnecessaryfor the cases #2, #7, #10, #11, #3, #4, #16, #9, #19, #27, #23, #22, #24 after palindromic codon pairs andthe last one #32 (Fig 1A), whereas the reverse operation is necessary for the remaining cases #5, #6, #8, #12,#13, #14, #15, #17, #18, #20, #21, #25, #26, #28, #29, #30, #31 (Fig 1A).
1.3 The ending of the roadmap
Nowadays the genetic code table had been expanded from the 6 codon pairs in the initial subset to the 6 + 18codon pairs by route duality; the remaining 8 codon pairs were recruited into the genetic code table in theending stage of the roadmap (Fig 1A, 2B). There were 2 codon pairs remained in each of the four routesRoute 0 − 3 respectively. They satisfied pair connections as follows: #23 − Phe − #32, #21 − Ile − #30,#22 − Met/Ile − #29, #28 − Leu − #31 (Fig 2B). Two of them satisfied route duality (Fig 2B):
#21 − Ile − #30 ∼ #22 − Met/Ile − #29.
The last two stop codons appeared in the pair connection #25 − stop − #31 (Fig 1A, 2B). When the last twoamino acids were recruited through the base pairs #26 − Asn − #30 and #27 − Lys − #32, the codon UAG at#25 had to be selected as a stop codon. Due to no participation of corresponding tRNA, the codon UAA at#31 was selected as the last stop codon.
The non-standard codons also satisfy codon pairs and route dualities in the roadmap (Fig 1A). The codonpairs pertaining to non-standard codons are as follows: #11 − Arg (S er, stop) − #14, #4 − Leu (Thr) − #27in Route 0; none in Route 1; #22 − (Met) − #29 in Route 2; #20 − Leu (Thr,Gln) − #25, #12 − (Trp) − #15,#25− stop (Gln)/Leu−#31, #28−Leu (Gln)−#31 in Route 3. Majority of non-standard codons appear in thelast Route 3 (Fig 1A). Route dualities of non-standard codons exist between Route 0 and Route 3 (Fig 1A):
#4 − Leu (Thr) − #27 ∼ #20 − Leu (Thr) − #25
#11 − (stop) − #14 ∼ #12 − Trp/stop − #15,
where the first stop codon UGA at #15 is dual to the non-standard stop codons in Route 0.
The stabilities of base triplexes played crucial roles in the evolution of the genetic code. It had beenemphasised that the roadmap followed the strict rule that the stabilities of base triplexes increase gradually(Fig Sla1). Also note that the roadmap had tried its best to avoid the unstable triplex DNA. Among the 16possible base triplexes, there are three relatively unstable base triplexes: GC ∗ A, AT ∗ C and AT ∗ A. Sothe statistical ratio of instability for the base triplexes is 3/16. However, the ratio of instability for the basetriplexes in the roadmap is much smaller. There are 49 triplex DNAs through #1 to #32 in the roadmap,which involve 3 × 49 = 147 base triplexes (Fig 1A). The relatively unstable base triplexes GC ∗ A andAT ∗ C have not appeared in the roadmap; only the relatively unstable base triplex AT ∗ A has appearedinevitably for 7 times in the reverse operations so as to fulfil all the permutations of 64 codons (Fig 1A). The
7
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
ratio of instability 7/147 in the roadmap is much smaller than the ratio of instability 3/16 by the statisticalrequirement. When the relatively unstable AT ∗ A appears at the positions #15, #17, #21, #25, #29, #30 and#31, both stabilities of the other two base triplexes in the triplex DNA are (4+) (Fig 1A), which compensatesthe instability of the triplex DNA to some extent. The amino acid Ile, whose degeneracy uniquely is three,occupied three positions #21, #29 and #30 among those 7 positions. And the three stop codons occupiedother three neighbour positions #15, #25 and #31 (Fig 1A). The route dualities played significant roles inthe expansion stage (Fig 1A, 2B). The hence remnant codons were chosen as the stop codons. The first stopcodon UGA appeared at the position #15 where the relatively unstable AT ∗ A appeared firstly (Fig 1A),hence it is possible that the instability might be helpful to achieve the function of stop codons. The stopcodon appeared as early as the midway of the evolution of the genetic code (Fig 1A, 2B), which indicatesthat the genetic code had been taken shape around the midway to promote the formation of the primitive life.Not until the fulfilment of the genetic code, did the translation efficiency increase notably by recognising allthe 64 codons.
1.4 Origin of tRNAs
The roadmap illustrates the coevolution of the genetic code and the amino acids, in which tRNAs playedmediate roles. The recruitment and expansion of tRNAs underlay pair connections and route dualities. Thissubsection explains the origin of tRNAs according to the roadmap.
There are 16 branch nodes and 16 leaf nodes in the roadmap (Fig 1A, 2A). The leaf nodes are as follows:#3, #14, #27, #32 in Route 0; #8, #17, #26, #30 in Route 1; #9, #18, #22, #29 in Route 2; #13, #15, #28, #31in Route 3. And the branch nodes are as follows: #1, #4, #11, #23 in Route 0; #2, #5, #6, #21 in Route 1;#7, #16, #19, #24 in Route 2; #10, #12, #20, #25 in Route 3. For each branch node, the third strand of thetriplex nucleic acids is of DNA (Fig 1A). Nevertheless the third strand is an RNA strand (Fig 2A). Namelythe first two DNA strands are combined by a third RNA strand. Such triplex nucleic acids are of two types:the triplex nucleic acid YR ∗ Yt and the triplex nucleic acid YR ∗ Rt (Fig 2A), where the subscript t denotestRNA. After departing from the triplex nucleic acids, the combination of the two complementary single RNAstrands Yt and Rt fold into a cloverleaf-shaped tRNA [5] [6] [7] [8] [9], which acquired a certain anticodoncorresponding to its position in the roadmap (Fig S2a3).
For example, the tRNA t1 can form from the two triplex nucleic acids Y1R1 ∗ Yt1 and Y1R1 ∗ Rt1 in theposition #1(Fig 2A #1). The triple code CCC in the strand Yt1 is palindromic, while the triple code GGG in thestrand Rt1 is palindromic. The two complementary strands Yt1 and Rt1 can combine into a cloverleaf-shapedtRNA t1 by joining together, complementary pairing and folding (Fig S2a3). Thus anticodon arm of t1contains the anticodon CCC, which corresponds to Gly; namely t1 transports Gly. As another example, thecodon in the position #11 is not palindromic, where two tRNAs t7 and t10 are folded respectively (Fig 2A#11). Two complementary single RNA strands Yt11 and Rt11 form by reverse cooperation. They combineinto the tRNA t7 by joining together, complementary pairing and folding, which contains the anticodon GGA
8
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
and carry S er (Fig S2a3). Furthermore, another two complementary single RNA strands yt11 and rt11 canform. They combine into the tRNA t10, which contains the anticodon CCU and carry Arg (Fig S2a3).
There are 4 palindromic codons: #1 CCC ·GGG, #4 CUC ·GAG, #7 CGC ·GCG, #19 CAC ·GUG amongthe 16 branch nodes in the roadmap (Fig S2a3). Accordingly there are 12 non-palindromic codons among thebranch nodes in the positions #2, #5, #6, #10, #11, #12, #16, #20, #21, #23, #24 and #25. Due to the bijectionbetween Route 1 and Route 3 in the sense of reverse relationship (Fig 1A, 2A), the complementary single RNAstrands obtained in the two routes are same to each other (Fig 2A). Thus, there are totally 4+ (12−4)×2 = 20pairs (4 palindromic codons, and the 12 non-palindromic codons minus 4 identities between Route 1 andRoute 3) of complementary single RNA strands, which combine into 20 tRNAs respectively (Fig S2a1, S2a3).
The anticodons of tRNAs such obtained may originate from either the triplet code in Yt or the tripletcode in Rt of the 20 possible pairs of triplet codes in the branch nodes (Fig S2a3). An assignment schemecan be obtained so that 20 anticodons chosen from the 20 pairs of possible triplet codes can be assigned tothe 20 canonical amino acids respectively (Fig S2a2), which is as follows in details (Fig 2A, S2a2, S2a3).Note that these tRNAs are named after the series number of the recruitment order of the amino acids in theroadmap. There are 6 tRNAs (t1:1Gly, t3:3Glu, t7:7S er, t10:10Arg, t17:17Phe, t20:20Lys) from Route 0, 2tRNAs (t4:4Asp, t19:19Asn) from Route 1, 6 tRNAs (t2:2Ala, t9:9Thr, t11:11Cys, t13:13His, t16:16Met,t18:18Tyr) from Route 2, and 6 tRNAs (t5:5Val, t6:6Pro, t8:8Leu, t12:12Trp, t14:14Gln, t15:15Ile) fromRoute 3. The anticodons of t4 and t5 originated respectively from yt5 and Rt20 of the same pairs yt5rt5 orYt20Rt20 (Fig 2A). The tRNA t19 originated from the combination of Yt30 and Rt30, whose anticodon AUU
came from a transition from C to U (Fig 2A, S2a2).
There are 20 canonical amino acids encoded by the degenerate codons. The above assignment scheme hasprovided an argument to account for the number of the canonical amino acids (Fig 2A, S2a2, S2a3). It willbe explained, in Subsection 2.3, that not only the 64 codons result from the 64 permutations of the 4 bases,but also the 20 canonical amino acids relate to the 20 combinations of the 4 bases (Fig S3d1). The genomiccodon distributions can be divided into 20 classes according to the 20 combinations of the 4 bases (Fig S3d1).Given the base compositions p(i), i = G, C, A, T , the productsp(i)∗ p( j)∗ p(k), i, j, k = G, C, A, T are equalfor each of the 20 combinations. The 20 combinations of the 4 bases can be divided into 4 groups < G >,< C >, < A >, < T > (Fig S3d1). Hierarchy 1 and Hierarchy 2 corresponds < G > and < C >; Hierarchy 3and Hierarchy 4 corresponds to < A > and < T >. Their positions in the roadmap are
Hierarchy 1 − 2 Y : < G > , Hierarchy 1 − 2 R : < C >,
Hierarchy 3 − 4 Y : < A > , Hierarchy 3 − 4 R : < T > .
Route 0 and Route 1 − 3 are distinguished in each of the four groups. Take < G > for example, < G, G, G >
an < G, G, A > belong to Route 0, while < G, G, C >, < G, G, T > and < G, C, A > belong to Route 1−3.The 20 tRNAs transfer 20 amino acids, the number of which is equal to the number of the 20 classes ofgenomic codon distributions. This is helpful to improve the translation efficiency, because it is easier for atRNA to recognise a certain class of genomic codon distributions with the same codon interval distance.
9
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
The wobble pairing rules can be explained by the theory on the origin of tRNAs. The transition from C
to T occurred in the position #6, which resulted in the wobble pairing rule G : U or C (Fig 2A). Taking y2r2as a template, yt2 with GCC is formed by the triplex base paring, while rt2 with GGC and r′t 2 with GGU areformed, where the transition from C to U occurred in the formation of r′t 2. The complementary strands yt2and r′t 2 combine into a tRNA with anticodon GCC, where G at the first position of the anticodon of the tRNAis paired with U at the third position of the triple code of an additional single strand r′t 2. It implies that thewobble pairing rule G : U had been established as early as the end of the initiation stage of the roadmap. Thetransition from C to T occurred in the position #12, which resulted in the wobble pairing rule U : G or A (Fig2A). Taking y10r10 as a template, yt10 with CCG is formed by the triplex base paring, and rt10 with CGG
and r′t 10 with UGG are also formed, where the transition from C to U occurred in the formation of r′t 10.The complementary strands yt10 and r′t 10 combine into a tRNA with anticodon UGG, where U at the firstposition of the anticodon of the tRNA is paired with G at the third position of the triple code of an additionalsingle strand yt10. The above explanation of the wobble pairing rules by tRNA mutations is supported bythe observations of nonsense suppressor. For instance, the wobble pairing rule C : A for a UGA suppressorcan be established by a transition from G to A at the 24 − th position of tRNATrp. The wobble pairing rulesG : U or C and U : G or A had been established early in the evolution of the genetic code (Fig 2A), whichcontinued to flourish so as to make full use of the tRNAs in short supply.
The 20 anticodons in the tRNAs from t1 to t20 correspond to the 20 amino acids, respectively. Hence,the corresponding 20 codons can be regarded as the minimum subset of the genetic code that encodes allthe amino acids (Fig S2a2). The minimum subset consists of GGG, GCG, GAG, GAC, GUG, CCG, UCC,CUA, ACG, AGG, UGC, UGG, CAC, CAG, AUC, AUG, UUC, UAC, AAU, AAG. For each amino acid, itscorresponding codon in the minimum subset generally recruited earlier than the other corresponding codonsout of the minimum subset (Fig S2c2). The codons in the minimum subset whose first base are purinesgenerally correspond to the anticodons that come from the strands Yt, except for GUC and AUC (Fig S2a2);those whose first base are pyrimidines generally correspond to anticodons that come from the strands Rt,except for CAG and UGG (Fig S2a2). The codons in the minimum subset are generally situated in the branchnodes, except for AAG, UCC, AAU, AUG and UGC (Fig 2A, S2a2). The numbers of codons in the minimumsubset in Route 0−3 are 6, 4, 6, 4 respectively (Fig 2A, S2a2). The minimum subset can be extended throughwobble pairings. Additional tRNAs can be combined according to the roadmap so as to recognise all thecodons.
1.5 Codon degeneracy
The four routes of the roadmap (Fig 1A) can be represented by four cubes (Fig 2C). The vertices of the cubesrepresent codon pairs, while the edges of the cubes represent substitution relationships (Fig 2C). Cubes areconvenient to demonstrate pair connections and route dualities, by which it is more intuitive to study thecodon degeneracy. The degeneracies 6, 4, 3, 2 or 1 for the 20 amino acids can be explained one by oneaccording to pair connections and route dualities in the roadmap (Fig 1A, 2C). The degeneracy 2 mainly
10
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
results from pair connections. The degeneracy 4 or 6 mainly result from the expansion of the genetic codefrom the initial subset by route dualities pertaining to S er, Leu, Ala, Val, Pro and Thr (Fig 2B, 2C).
The degeneracy 6 for S er, Leu and Arg can be explained by pair connections and route dualities (Fig 1A,2C), where S er and Leu belong to the initial subset and Arg was recruited immediately after the initial subset.And all of them have appeared in Route 0. The 6 codons of S er satisfy both the pair connection #8−S er−#17and the route duality
#3 − S er − #14 ∼ #13 − S er − #15.
The 6 codons of Leu satisfy both the pair connection #28 − Leu − #31 and the route duality
#4 − Leu − #27 ∼ #20 − Leu − #25.
The 6 codons of Arg satisfy both the pair connection #7 − Arg − #16 and the route duality
#11 − Arg − #14 ∼ #10 − Arg − #13.
The degeneracy 4 for Gly, Ala, Val, Pro and Thr can be explained by route dualities (Fig 1A, 2C). All ofthem belong to the initial subset. The degeneracy 4 for Gly satisfy the route duality:
#1 −Gly − #3 ∼ #2 −Gly − #6.
The degeneracy 4 for Ala satisfy the route duality:
#2 − Ala − #8 ∼ #7 − Ala − #9.
The degeneracy 4 for Val satisfy the route duality:
#5 − Val − #26 ∼ #19 − Val − #24.
The degeneracy 4 for Pro satisfy the route duality:
#1 − Pro − #11 ∼ #10 − Pro − #12.
The degeneracy 4 for Thr satisfy the route duality:
#6 − Thr − #17 ∼ #16 − Thr − #18.
The degeneracy 2 for Glu, Asp, Cys, His, Gln, Phe, Tyr, Asn and Lys can be explained by pair connections(Fig 1A, 2C). They satisfy the following pair connections respectively: #4 − Glu − #23, #5 − Asp − #21,#9 − Cys − #18, #19 − His − #22, #20 − Gln − #28, #23 − Phe − #32, #24 − Tyr − #29, #26 − Asn − #30,#27 − Lys − #32. The degeneracy 3 for Ile and the degeneracy 1 for Met satisfies the route duality (Fig 1A,2C)
#21 − Ile − #30 ∼ #22 − Met/Ile − #29.
11
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
The degeneracy 1 for Trp satisfies the pair connection for nonstandard genetic code #12 − Trp/stop(Trp) −#15. This pair connection includes a stop codon; the other stop codons satisfy the pair connection: #25 −stop − #31 (Fig 1A, 2C).
It is convenient to illustrate the symmetry of the roadmap by the cubes (Fig 2C). Some correspondencesof codon pairs among routes are as follows (Fig 2C):
Route 0 : UVW · wvu ↔ Route 1 : UVw ·Wvu
Route 0 : WUV · vuw ↔ Route 3 : wUV · vuW
Route 1 : WUV · vuw ↔ Route 2 : vuW · wUV
Route 2 : UVW · wvu ↔ Route 3 : VUW · wuv.
The feature of pair connections in the roadmap can be explained by the wobble pairings. The pair connectionsin the routes: Route 0 R strand, Route 2 R strand and Route 3 can be explained by the wobble pairing ruleU : G, A (Fig 2C); The pair connections in the routes: Route 0 Y strand, Route 2 Y strand and Route 1can be explained by the wobble pairing rule G : U,C (Fig 2C). By and large, there is a symmetry betweenRoute 0, 1 and Route 3, 2 in the roadmap (Fig 1A, 2C). Considering the expansion of the genetic code fromthe initial subset, the pair connections pertaining to Pro, S er and Leu in Route 0 Y strand are dual to the pairconnections in Route 3 Y strand, respectively (Fig 2B, 2C); the pair connections pertaining to Ala, Thr andVal in Route 1 Y strand are dual to the pair connections in Route 2 R strand, respectively (Fig 2B, 2C). Thecodon pairs can be divided into groups by the 6 facets of the cubes for the routes respectively: Facet u andFacet d by up and down; Facet w and Facet e by west (left) and east (right); Facet n and Facet s by north(back) and south (front) (Fig 2C). According to the route dualities, Route 0 and Facet u are dual to Route 3and Facet s; Route 0 and Facet d are dual to Route 3 and Facet n; Route 1 and Facet s are dual to Route 2and Facet u; Route 1 and Facet n are dual to Route 2 and Facet d (Fig 2C).
The roadmap not only explain the codon degeneracy for the 20 amino acids, but also discriminate the 5biosynthetic families of the amino acids. A GCAU genetic code table can be obtained by changing the orderU, C, A, G in the standard genetic code table to a new order G, C, A, U, where the third base are sorted by theorder G, A, C, U in considering the wobble pairings (Fig S2c2). The biosynthetic families of Glu, Asp, Val,S er and Phe situate in the continuous regions in the GCAU genetic code table, respectively (Fig S2c2). Thecodons as well as amino acids, in the recruitment orders, expand roughly from up to down, from left to rightin the GCAU genetic code table (Fig S2c2). So the evolution of the genetic code can be illustrated better inthe GCAU genetic code table than in the standard genetic code table. The clusters of the biosynthetic familiescan be explained by the roadmap. The codons are classified by the R strands and Y strands of the four routesas follows:
Route 0 R : RRR , Route 0 Y : YYY,
Route 1 R : RRY , Route 1 Y : RYY,
Route 2 R : RYR , Route 2 Y : YRY,
Route 3 R : YRR , Route 3 Y : YYR.
12
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
In such a classification, the GCAU genetic code table can be divided into 8 × 4 blocks (Fig S2c1), whichconsequently influences the codon degeneracy and the clustering of biosynthetic families (Fig S2c2). Theregion of Route 0 R GRR and Route 1 GNY (N denotes purine or pyrimidine) relate to the biosyntheticrelationships among Gly, Ala, Glu, Asp, Val (Fig S2c1, S2c2). The region of Route 3 CNR explains thealignment of Arg, Pro, Gln that belong to the biosynthetic family of Glu (Fig S2c1, S2c2). The region ofRoute 1 ANY explains the alignment of Thr, Asn, Ile that belong to the biosynthetic family of Asp (Fig S2c1,S2c2). The region of Route 1 RNY relates to the biosynthetic relationships among Asp and Thr, Asn (FigS2c1, S2c2). The region of Route 3 R URR explains the correlations of positions among the 3 stop codons(Fig S2c1, S2c2). Moreover, the region of Route 0 R RRR can explain the biosynthetic relationships fromGlu to Lys; and ARG was assigned to Arg of the biosynthetic family of Glu (Fig S2c1, S2c2). The regionof Route 2 R GYR explains the biosynthetic relationships from Ala to Val (Fig S2c1, S2c2). The region ofRoute 3 Y YUR explains the degeneracy of Leu (Fig S2c1, S2c2).
The roadmap provides a framework beyond the explanation of the evolution of the genetic code. Accordingto the roadmap, the genetic code, tRNA, amino acid and aminoacyl-tRNA synthetase evolved together.Except the certain three positions of the codon, other positions also evolve along the roadmap (Fig S3a1).The evolution of the triplex nucleic acids along the roadmap might generate the corresponding sequenceencoding the aminoacyl-tRNA synthetase. The classification of the aminoacyl-tRNA synthetases distributedregularly in the roadmap. The aminoacyl-tRNA synthetases are divided into two classes Class II and Class I,and accordingly subclasses IIA, IIB, IIC; IA, IB, IC [10]. According to the recruitment order of theamino acids, the Class II aminoacyl-tRNA synthetases originated earlier than the Class I aminoacyl-tRNAsynthetases (Fig 2A, 2C). In the initial subset, #1, #2, #3, #5 Y , #6 are Class II aminoacyl-tRNA synthetases;Gly − tRNA appeared at #1 as the first Class II aminoacyl-tRNA synthetase (Fig 2C). #4, #5 R are Class I
aminoacyl-tRNA synthetases; Glu−tRNA appeared at #4 as the first Class I aminoacyl-tRNA synthetase (Fig2C). Majority of the Route 0, 1 are Class II aminoacyl-tRNA synthetases; majority of the Route 2, 3 areClass I aminoacyl-tRNA synthetases (Fig 2C). The exceptions mostly concern with the route duality betweenRoute 0 and Route 3 pertaining to S er, Leu, Pro, and route duality between Route 1 and Route 2 pertainingto Ala, Val, Thr (Fig 2C). The early Class II aminoacyl-tRNA synthetases recognize the variable arm of thetRNA. According to the explanation of the origin of the tRNA, the variable arm is complementary with theanticodon arm (Fig S2a3). Paracodons may be around the variable arm. The anticodons of the degeneracies4 or 6 are GGN, GCN, GUN, CGN, CCN, CUN, ACN and UCN. The aminoacyl-tRNA synthetases inthese codon boxes are IIA or IA (Fig S2c3). The aminoacyl-tRNA synthetases with anticodons of the smalldegeneracies are IIB, IIC, IB, IC (Fig S2c3). The aminoacyl-tRNA synthetases can also be divided intomajor groove (M) and minor groove (m) [11]. They distribute regularly in the roadmap (Fig 2C):
Route 0 Facet u : M , Route 0 Facet d : m;
Route 1 Facet s : M , Route 1 Facet n(Y/R) : M/m;
Route 2 Facet u(Y/R) : M/m , Route 2 Facet d(Y/R) : m/M;
Route 3 Facet s(Y/R) : m/M , Route 3 Facet n : m.
13
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
The aminoacyl-tRNA synthetasis and the tRNA determines the pair connections and route dualities in theroadmap, which resulted in the codon degeneracy for amino acids and the 5 biosynthetic families of aminoacids (Fig 2C).
1.6 Origin of homochirality of life
Homochirality is an essential feature of life. In Part I of this paper, the origin of the homochirality of life isexplained by chiral symmetry breaking between the left-handed roadmap and the right-handed roadmap (Fig2D), which helps to explain the origin of the living system from the non-living system. This is analogous to(but not trivially related to) the spontaneous symmetry breaking in particle physics [12] [13], which helps toexplain the generation of masses from massless fields. The time for the existence of life on the Earth is in thesame order of magnitude of the age of the universe, which implies that life is an intrinsic phenomenon in theuniverse that originates spontaneously at the macromolecular level. Homochirality safeguards the stability ofthe living system, which has ensured the life to last comparable with the age of the universe.
The first recruited amino acid Gly is not chiral (Fig 1A #1). The following 19 amino acids are chiral (Fig1A #2). In the initiation stage of the roadmap, the nonchiral Gly helped to create the first pair connection#1 −Gly − #2, recruiting chiral Ala at #2 (Fig 1A). And the nonchiral Gly also helped to create the first routeduality in the roadmap (Fig 1A):
#1 −Gly − #3 ∼ #2 −Gly − #6.
This route duality played a central role in the initiation stage; consequently the initial subset played a centralrole in the expansion stage (Fig 2B). The chirality was required at the beginning of the roadmap by thetriplex DNA itself (Fig 1A, 1B). Even so, there was still a transition period from nonchirality to chirality, inconsideration of the special role of nonchiral Gly.
The complex evolution from Hierarchy 1 to Hierarchy 4 was a recurrent and cyclic process, with theintermittent chemical reaction conditions in environment. There were two possible chiral roadmaps in theevolution of the genetic code as follows (Fig 2D):
Right-handed chiral roadmap
from #1 to #32, right-handed triplex DNA, D-ribose, L-amino acids;
Left-handed chiral roadmap
from 1# to 32#, left-handed triplex DNA, L-ribose, D-amino acids.
The four bases themselves are not chiral, which are shared in common by the two chiral roadmaps. The twochiralities were selected randomly in the winner-take-all competition. The translation efficiency improvedgreatly as soon as all the 64 codons were recruited into the roadmap. First completed chiral roadmapdominated and was able to decompose the uncompleted opposite chiral system. Hence, there is no longerthe chance for the existence of the opposite chirality.
14
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
The origin of the homochirality of life can be simulated based on the roadmap. The competition betweenthe right-handed chiral roadmap and the left-handed chiral roadmap can be simulated by a stochastic model.The selection of chirality of life was absolutely by chance with equal probability. The chiral roadmap evolvesfrom #1 to #32 or from 1# to 32# by a constance substitution probability step by step (Fig S2d1). Thestruggle for shared resources, such as the bases, might existed between the two chiral roadmaps (Fig S2d1).The model can estimate how many steps the winner pulled away from the loser in the competition of thetwo chiral roadmaps. The results of the estimations are as follows: around 3 steps gap in the case withoutstruggle for bases (Fig S2d1), and around 7 steps gap in the case with struggle for bases (Fig S2d1). In eithercase, the chiral roadmap of the loser had already finished the initiation stage and expansion stage according tothe simulations. Namely, it was in the ending stage that the two chiral roadmaps were separated (Fig S2d1).The recruitment of the last two stop codons in the ending stage may promote the separation of the two chiralroadmaps (Fig 1A, S2d1).
It is a great expectation to achieve the homochirality of life based on the roadmap for the experimentalists.Artificial conditions are more operational and efficient than the environmental changes. Proper proceduresand schedules are expected to considerably shorten the time to achieve the homochirality of life in laboratory.This is indeed a challenge.
1.7 Evidence for the roadmap
So far, a detailed description of the evolution of the genetic code has been proposed according to the roadmaptheory. In this subsection, numerous theoretical and experimental evidences are listed concretely that will beexplained here or later on.
The key experimental evidences in supporting of the roadmap are as follows. The roadmap itself is basedon the experimental results on the relative stabilities of the base triplexes. The delicate roadmap has narrowlyavoided the unstable base triplexes. The layout of the roadmap is, therefore, determined by the increasingstabilities of the base triplex along the roadmap (Fig Sla1). Genomic data provide the most successful andmost straightforward experimental evidence for the roadmap, which will be explained in detail in Section2. An evolutionary tree of codons is obtained based on the genomic codon distributions, where the fourhierarchies of the roadmap are distinguished clearly (Fig 3A). This is a straightforward evidence to supportthe roadmap theory. Coincidentally, an evolutionary tree of species can also be obtained based on the samegenomic codon distributions and the roadmap (Fig S3d8, S3d9), which agrees with the tree of three domains.The successful explanation of the origin of the three domains also conversely supports the roadmap theory.Another strong and heuristic evidence for the roadmap is that the origin of the homochirality of life canbe explained by the roadmap straightforwardly (Fig 2D). If the homochirality of life can be achieved bythe roadmap in experiment, the roadmap will be conclusively confirmed. The origin of the genetic code,the origin of the homochirality of life and the origin of the three domains can be explained together in theroadmap picture (Fig 1, 2, 3). These three theories support one another.
15
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
Furthermore, the roadmap is supported by the following evidences. The roadmap predict the recruitmentorder of the 64 codons, the recruitment order of the 3 stop codons and the recruitment order of the 20 aminoacids. Evaluation of the three recruitment orders may verify the roadmap theory.
The recruitment order of the 32 codon pairs can be obtained from #1 to #32 by the roadmap (Fig 2B,S2b1). The declining GC content indicates the evolution direction because of the substitutions from G to A,from G to C and from C to T in the roadmap (Fig S2b1). According to the roadmap, the total GC content andthe position specific GC content for the 1st, 2nd and 3rd codon positions are calculated for each step from #1to #32. Then the relationship between the total GC content and the position specific GC content is obtained(Fig S2b2). The 1st position GC content is higher than the 2nd position GC content. And the 3rd position GC
content declined rapidly from the highest to the lowest in the evolution direction when the total GC contentdeclines. The GC content variation in the simulation agrees with the observation (to compare Fig S2b2 withFigure 2 in [14] and Figure 5 in [15]), so the recruitment order obtained by the roadmap is reasonable. Therecruitment order of the codons by the roadmap generally agree with the order in [4] [16] [17]. It shouldbe noted that the roadmap itself was conceived firstly by studying the substitution relationships based on therecruitment order in [4], which was furthermore improved in accordance with the recruitment order of aminoacids.
The recruitment order for the three stop codons are #15 UGA, #25 UAG, #31 UAA (Fig 1A, 2B), whichresults in the different variations of the stop codon usages. Along the evolution direction as the declining GC
content, the usage of the first stop codon UGA decreases; the usage of the second stop codon UAG remainsalmost constantly; the usage of the third stop codon UAA increases (Figure 1 in [18]). The observations ofthe variations of stop codon usages can be simulated (to compare Fig S2b3 with Figure 1 in [18]) accordingto the recruitment order of the codon pairs (Fig 2B, S2b1) and the variation range of the stop codon usages.Especially, the detailed features in observation can be simulated that the usage of UGA jump downwardsgreatly; UAA, upwards greatly, around half GC content (to compare Fig S2b3 with Figure 1 in [18]).
The recruitment order of the 20 amino acids from No.1 to No.20 can be obtained by the roadmap (Fig 2B,S2b1), which meets the basic requirement that Phase I amino acids appeared earlier than the Phase II aminoacids [19] [20]. The species with complete genome sequences are sorted by the order R10/10 according to theiramino acid frequencies, where the order R10/10 is defined as the ratio of the average amino acid frequenciesfor the last 10 amino acids to that for the first 10 amino acids [21]. Along the evolutionary direction indicatedby the increasing R10/10, the amino acid frequencies vary in different monotonous manners for the 20 aminoacids respectively (Fig S2b4). For the early amino acids Gly, Ala, Asp, Val, Pro, the amino acid frequenciestend to decrease greatly, except for Glu to increase slightly (Fig S2b4); for the midterm amino acids S er, Leu,Thr, Cys, Trp, His, Gln, the amino acid frequencies tend to vary slightly, except for Arg to decrease greatly(Fig S2b4); for the late amino acids Ile, Phe, Tyr, Asn, Lys, the amino acid frequencies tend to increasegreatly, except for Met to increase slightly (Fig S2b4). In the recruitment order from No.1 to No.20, thevariation trends of the amino acid frequencies increase in general (Fig S2b5); namely, the later the amino acidsrecruited, the more greatly the amino acid frequencies tend to increase (Fig S2b4, S2b5). The recruitment
16
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
order of the amino acids from No.1 to No.20 is supported not only by the previous roadmap theory but alsoby this pattern of amino acid frequencies based on genomic data, so it help to resolve the disputes on such anissue in the literatures [22].
17
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
2 Origin of the three domains
2.1 Genomic codon distributions reveal the relationship between the evolution of thegenetic code and the tree of life
In the previous section, the roadmap theory was proposed, as a hypothesis, to explain the evolution of thegenetic code and the origin of the homochirality of life (Fig 1, 2). In this section, more experimental evidencesare provided to enhance and reinforce the roadmap theory (Fig 3). Especially, the four hierarchies in theroadmap are directly distinguished in a tree of codons obtained by studying the genomic codon distributions(Fig 3A). And the origin of the three domains of life can be explained based on the roadmap and the genomicdata (Fig 3D). In short, the biodiversity originated in the evolution of the genetic code, and has flourished bythe environmental changes.
The viewpoint that the tree of life was rooted in the recruitment of codons can be proved by analysingthe complete genome sequences. The genomic codon distribution is a powerful method to study the statisticalproperties of the complete genome sequences, which can reveal the profound relationship between the primordialgenetic code evolution and the contemporary tree of life (Fig 3A, S3a5). The genomic codon distribution fora species with complete genome sequence is defined as follows (Fig S3a3): first, mark all the positions foreach of the 64 codons in its complete genome sequence respectively; second, calculate all the codon intervalsbetween the adjacent marked positions for each of the 64 codons respectively, and obtain the distributions ofcodon intervals according to the 64 sets of codon intervals respectively (some codon intervals count more, andthe others less), where a cutoff 1000 is used when counting the number of common intervals; third, obtain thegenomic codon distribution by normalising the 64 distributions of codon intervals. Normalisation eliminatesthe direct impact of a wide range variation of genome sizes.
Both the tree of codons (Fig 3A) and the tree of species (Fig S3a5) are obtained according to the same setof correlation coefficients among the genomic codon distributions. 952 complete genomes in GeneBank (upto 2009) were used in the calculations. There are 64 genomic codon distributions for each of the 952 species.On one hand, the average correlation coefficient matrix for codons is obtained by averaging these 64 × 64correlation coefficient matrices for the 952 species; hence the tree of codons is obtained based on the 64 × 64distance matrix that equals to half of one minus the average correlation coefficient matrix for codons (Fig3A); on the other hand, the average correlation coefficient matrix for species is obtained by averaging these952 × 952 correlation coefficient matrices for the 64 codons, hence the tree of species is obtained based onthe 952 × 952 distance matrix that equals to half of one minus the average correlation coefficient matrix forspecies (Fig S3a5). The codons from the four hierarchies Hierarchy 1−4 in the roadmap gather together in thetree of codons respectively, where the Hierarchy 1 and Hierarchy 2 situated in one branch, and Hierarchy 3and Hierarchy 4 in another branch (Fig 3A). This is a direct evidence to support the roadmap (Fig 1A, 3A).According to the NCBI taxonomy, the tree of species roughly presents the phylogeny of species (Fig S3a5).Especially, the tree of the eukaryotes with complete genomes is reasonable to some extent (Fig S3a6). In
18
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
Subsection 2.3, an improved tree of species (Fig S3a8, S3a10) will be obtained by dimension reduction forthe 64 × 952 genomic codon distributions according to the roadmap (Fig S3d2, 3D).
The validity of the method of genomic codon distribution can be explained in the roadmap picture.Whereas the more concerns were given to the three-base-length segments of the triplex DNA in the previoussection, base substitutions can occur in any places through the whole length of the triplex DNA (Fig S3a1).Starting from the initial Poly C · poly G ∗ poly G, several codons in different regions of the triplex DNA canevolve independently along the roadmap (Fig S3a1). In this way, the triplex DNA evolves gradually fromthe initial identical sequences to the final non-identical sequences (Fig S3a1). The identical sequences tendto form a triplex DNA, but it becomes more and more difficult for the non-identical sequences to form atriplex DNA. So there was a transition from triplex DNA to double DNA after the genetic code evolution.The evolution of the whole sequence may result in varieties of distributions of codons in the genomes (e.g.,T AA in the last strand in Fig S3a1). The genomic codon distributions, therefore, resulted from the evolutionof the genome sequences. Thus, the biodiversity at the species level actually originated at the sequencelevel. The genomic codon distributions keep the essential features of species at the sequence level. Thegenomic codon distributions were reluctant to change, in the context of normalisation, during the evolutionof genome sizes mainly promoted by large scale duplications (Fig S3a4). In the long history of evolution,even if genome sizes have been doubled for many times, the essential features in the ancestors’ sequencescan be inherited generation by generation due to the relative invariant of the genomic codon distributions. Asa result, the ancient information on the evolution of the genetic code and the origin of the three domains canbe dug out from the complete genome sequences of contemporary species. It should be emphasised that thesequences evolved based on the roadmap were by no mean random, which may account for the linguistics inthe biological sequences.
According to the roadmap, the initial subset took shape in the initiation stage of the evolution of thegenetic code (Fig 1A), where the GC content of the initial subset was 13/18; leaf nodes appeared in theending stage (Fig 1A), where the GC content of the leaf nodes decreased to 16/48 (Fig S3a1). The variationrange of GC contents from 16/48 to 13/18 obtained by the roadmap generally agrees with the observationsof GC contents for contemporary species (Fig S3a1, S3a2). Along with the evolution of the whole sequences(Fig S3a1), the shorter triplex DNAs only accommodate a proper subset of the genetic code; the longertriplex DNAs accommodate almost all the codons. Such a difference will be considered in Subsection 2.2to explain the different features of genomic codon distributions between Eukaryotes and Bacteria etc. Notethat Eukaryotic genomes mainly consists of non-coding DNA, the tree of eukaryotes based on their genomiccodon distributions is, therefore, mainly depended on the non-coding DNA (Fig S3a6, S3d10). This indicatesthat the non-coding DNA has played a substantial role in the evolution of eukaryotes.
19
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
2.2 Features of genomic codon distributions: observations and simulations
The features of genomic codon distributions are different among Virus, Bacteria, Archaea and Eukarya (Fig3B, S3b2), which are also called the (3 + 1, including Virus) domains for convenience. The genomic codondistribution for a domain is obtained by averaging the genomic codon distributions of all the species inthis domain (only considering those with complete genome sequences, in practice) (Fig 3B). The featureof genomic codon distribution for a species is similar to that of the corresponding domain (Fig 3B, S3b2).The method based on the genomic codon distribution apply not only for cellular life but also for viruses.Explanation of the features of the genomic codon distributions (Fig 3B, S3b3) is helpful to understand theorigin of the three domains and to clarify the status of virus.
The features of genomic codon distributions for the domains can be described respectively from twoaspects: first, the periodicity and the amplitudes of the genomic codon distributions; second, the order ofthe average distribution heights for 64 codons. (i) The features of genomic codon distribution for Virusinclude declining three-base-periodic fluctuations, high amplitudes (Fig 3B, S3b2). In addition, the orderof the average distribution heights is: Hierarchy 1, highest; followed by Hierarchy 2 and Hierarchy 3;and Hierarchy 4, lowest (Fig S3c1). (ii) The features of genomic codon distribution for Bacteria includedeclining three-base-periodic fluctuations, medium amplitudes (Fig 3B, S3b2). In addition, the order of theaverage distribution heights is: Hierarchy 1, highest; followed by Hierarchy 2, Hierarchy 3 and Hierarchy 4(Fig S3c1). (iii) The features of genomic codon distribution for Archaea include declining three-base-periodicfluctuations, medium amplitudes (Fig 3B, S3b2). In addition, the order of the average distribution heights isroughly opposite to that of Bacteria: Hierarchy 4, highest; followed by Hierarchy 3 and Hierarchy 2; andHierarchy 1, lowest (Fig S3c1). Comparing Archaea with Bacteria, the dwarf 2nd peak at 6-base intervalin the genomic codon distribution is even lower than the 3rd peak at 9-base interval (Fig 3B, S3b2). Thisdetailed feature is different from Bacteria. For Bacteria, the 2nd peak is higher than the 3rd peak for codonsin Hierarchy 1 and the heights are almost the same for the 2nd and 3rd peaks for other codons (Fig 3B, S3b2).(iv) The features of genomic codon distribution for Eukarya are more complicated than for other domains,which include declining distributions, indistinct multi-base-periodic fluctuations, ups and downs of peaks,or overlapping peaks for some codons (Fig 3B, S3b2). In addition, the order of the average distributionheights is similar to that of Archaea: Hierarchy 4, highest; followed by Hierarchy 3 and Hierarchy 2;and Hierarchy 1, lowest (Fig S3c1). The features of genomic codon distribution for Archaea put togetherthe features of both Archaea and Eukarya (Fig 3B, S3b2). Crenarchaeota and Euryarchaeota are similar ingenome size distributions (Fig 3B, S3b1). It is showed that the tree of the three domains is more reasonablethan the eocyte tree of life, according to the features of genomic codon distributions (Fig 3B, S3b1).
The origin order for the domains can be obtained according to the linear regression plot whose abscissaand ordinate represent the recruitment order of codon pairs and the order of average distribution heightsrespectively (Fig 3C, S3c1). The greater the slope of the regression line for a domain is, the more the laterecruited codons are. So, the order of the slopes of the regression lines represents the origin order of the
20
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
domains: 1 Virus, 2 Bacteria, 3 Archaea, 4 Eukarya (Fig 3C). The slope for Virus is negative and least (Fig3C), which means that majority of codons in the sequences recruited early. The virus originated earlier thanthe three domains. The slope for Bacteria is negative (Fig 3C). The proportion for Hierarchy 1 is high; thatfor Hierarchy 2 − 4, lower. The slope for Archaea is positive (Fig 3C), which means that majority of codonsin the sequences recruited late. The slope for Eukarya is positive and greatest (Fig 3C), which also means thatmajority of codons in the sequences recruited late. The Eukarya originated latest. Archaea originated betweenBacteria and Eukarya (Fig 3C). Limited data indicate that the slope for Crenarchaeota is less than the slopefor Euryarchaeota (Fig 3C), which suggests that Crenarchaeota may originate early than Euryarchaeota. Thisorigin order of the domains is independent from the base compositions for the domains, which are differentfrom the order of the domains by GC content (Fig S3a2).
The features of the genomic codon distributions for the domains can be simulated by a stochastic model.The features of declining three-base-periodic fluctuations can be simulated by the model (Fig 3B, S3b3).In the model, the bases G, C, A, T are given in certain base compositions, from which three bases arechosen statistically to make up a codon. Many a codon concatenates together to form a genome sequencein simulation. Hence, the genomic codon distribution is obtained by calculating the simulated genomesequence. When the length of the triplex DNA increased in the evolution along the roadmap (Fig S3a1),more and more codons are accommodated into the triplex DNA. The feature of declining three-base-periodicfluctuations, therefore, is achieved in the simulations. Let n0 denote the number of codons accommodated inthe triplex DNA. n0 codons are generated statistically to concatenate the simulated genome in several cycles.The genomic codon distribution thus obtained has the feature of declining three-base-periodic fluctuations.When n0 is small (e.g., around 16), the amplitude of the three-base-periodic fluctuations is great. Thegreater n0 is, the less the amplitude of the three-base-periodic fluctuations is. When n0 is 32 (namely allcodons participated in the concatenation), the three-base-periodic fluctuations vanish. The amplitude of thethree-base-periodic fluctuations decreases from Virus to Bacteria, then Archaea, finally Eukarya (Fig 3B,S3b3), which results from the increasing number of codons accommodated in the triplex DNA at differentstages in the evolution of the genetic code. The complicated features of genomic codon distributions foreukaryotes can also be simulated by the model. The features of ups and downs of peaks for eukaryotes canbe simulated by concatenating different set of codons that are generated from different subsets of the geneticcode (Fig S3b3). The alternative ”high peak - low peak” feature for some codons and the low amplitudesof the genomic codon distributions for other codons agree with the observations for true eukaryotes (Fig 3B,S3b2, S3b3).
The domains differ in the orders of the average distribution heights, which can also be simulated by thestatistical model (Fig S3c2). The GC content tends to decrease along the evolutionary direction accordingto the roadmap (Fig 1A, S2b2), which accounts for the above different orders among the domains. Differentbase compositions for G, C, A, T are assigned for Virus, Bacteria, Archaea and Eukarya in the simulations,in consideration of the decrement of GC content effected by the origin order of the domains. The simulationsof genomic codon distributions thus obtained for the domains (Fig S3b3) differ in the corresponding ordersof the average distribution heights (Fig S3c2). If high GC contents are assigned to Virus or Bacteria in the
21
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
simulation, the average distribution heights from high to low are as follows: Hierarchy 1, Hierarchy 2,Hierarchy 3, and Hierarchy 4 (Fig S3c2). If low GC contents are assigned to Archaea or Eukarya in thesimulation, the average distribution heights from high to low are as follows: Hierarchy 4, Hierarchy 3,Hierarchy 2, and Hierarchy 1 (Fig S3c2). The simulations generally agree with the observations as to theorder of the average distribution heights (Fig S3c1, S3c2).
Incidentally, the validity of the composition vector tree (a method to infer the phylogeny from genomesequences) in the literatures [23] [24] can be explained based on the genomic codon distribution in this paper(Fig S3a3). The average distribution heights for the 64 codons are proportional to the numbers of codonsin the genomes. The summations of the elements of the unnormalized genomic codon distributions for aspecies represent the numbers of the 64 codons in the genome. So a set of average distribution heights isequivalent to the K = 3 composition vector for this species. For example, the summations of the elements ofthe unnormalized genomic codon distributions of GCA and GCA are 32 and 38 respectively, which means thatthere are 33 GCA and 39 T AT in the genome (Fig S3a3). Hence, the K = 3 composition vector for this speciesis obtained, where the elements corresponding to GCA and T AT are 33 and 39 respectively (Fig S3a3). TheK = 3 composition vector (Fig S3a3) in fact corresponds to the order of the average distribution heights (FigS3c1). It has been emphasised that the genomic codon distributions are essential features of species at thesequence level. The composition vector, as a deformed method of the genomic codon distribution, is certainlyvalid to infer the phylogeny.
2.3 The evolution space and the tree of life
According to the roadmap theory and the features of genomic codon distributions, an evolution space (Fig3D) has been constructed to explain the major and minor branches of the tree of life (Fig S3d8, S3d9). Theevolution space is 3-dimensional, the coordinates of which are fluctuation, route bias and hierarchy bias (Fig3D). (i) The fluctuation is defined by the total summation of the absolute value of the neighbouring differencesof the fluctuations in the 64 genomic codon distributions for a species. (ii) The route bias is defined by thedifference of the average genomic codon distributions between Route 0 and Route 1 − 3 (Fig S3d2). (iii) Thehierarchy bias is defined by the difference of the average genomic codon distributions between Hierarchy 1−2and Hierarchy 3 − 4 (Fig S3d2).
It should be pointed out that the domains cannot easily be distinguished into clusters by the bias similarlydefined on difference of average genomic codon distributions between arbitrarily divided two groups ofcodons. So, the definitions of route bias and hierarchy bias themselves reveal the essential features of thegenetic code. After a lot of groping around, it was realised that only proper groupings of codons by theroadmap are able to distinguish the domains (Fig 3D). When defining route bias and hierarchy bias underthe major premise of clusterings of the three domains, the recruitment order of codons and the roadmapsymmetry are also in consideration (Fig S3d1, S3d2). The genome size distributions can be divided into 20groups according to the 20 combinations of the 4 bases (Fig S3d1). The number 20 of combinations is equal
22
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
to the number of canonical amino acids. And the 20 combinations generally correspond to the 20 tRNA inSubsection 1.4, among which 14 combinations correspond to certain tRNAs respectively (Fig S3d1, S2a3).So, the classifications of genome size distributions are of biological significance.
The three coordinates of the evolution space are obtained by calculations for each species, thus a speciescorresponds to a point in the evolution space (Fig 3D). The species gather together in different clusters forVirus, Bacteria, Archaea and Eukarya (Fig 3D, S3d3, S3d4, S3d5). The cluster distributions of taxa oflower rank are also observed in detail in the evolution space (Fig 3D, S3d3, S3d4, S3d5). The evolutionaryrelationships among species are thereby illustrated in the evolution space, which reveals the close relationbetween the origin of biodiversity and the origin of the genetic code. According to the fluctuations of genomiccodon distributions, the order of domains is as follows: first Virus, then Bacteria and Archaea, and lastEukarya (Fig S3d4, S3d5). Bacteria and Archaea are distinguished roughly by route bias (Fig S3d3, S3d5),and they can be distinguished better by both hierarchy bias and route bias (Fig 3D, S3d3). Diversification ofthe domains is, therefore, due to route bias and hierarchy bias at the sequence level. In addition, the leaf nodebias can be defined by the difference between the average genomic codon distributions of the leaf nodes inRoute 0−1 and that of the leaf nodes in Route 2−3 (Fig 1A, S3d2). Hence, Crenarchaeota and Euryarchaeotaare distinguished by leaf node bias (Fig S3d6, S3d7). Diversification of Crenarchaeota and Euryarchaeota is,therefore, due to leaf node bias at the sequence level. The cluster distributions of the domains in the evolutionspace shows that the biodiversity originated in the evolution of the genetic code at the sequence level. Andthe effects from the earlier stage of the evolution of the genetic code contribute more to the tree of life.
A tree of life (Fig S3d8) can be obtained based on the distances among species in the evolution space (Fig3D). The three coordinates of the evolution space need to be equally weighted by non-dimensionalization ofthe three coordinates through dividing by their standard deviations respectively first. Then a distance matrixis obtained by calculating the Euclidean distances among species in the non-dimensionalized evolution space.Thus, the tree of life is constructed by PHYLIP software based on the distance matrix (Fig S3d8). Note thatall the phylogenetic trees in this paper are constructed by PHYLIP software using UPGMA method [25].The tree of life based on the evolution space is reasonable to describe the evolutionary relationships amongspecies. Take the eukaryotes for example. Such a tree provides reasonable inclusion relationships amongvertebrates, mammals, primates, where chimpanzee is nearest to human beings. And yeast indicates its root(Fig S3d10).
A tree of taxa can also be obtained according to the average distances of species among Virus, Bacteria,Crenarchaeota, Euryarchaeota and Eukarya in the non-dimensionalized evolution space (Fig S3d9). The treeof taxa shows that the relationship between Crenarchaeota and Euryarchaeota is nearer than the relationshipbetween Euryarchaeota and Eukarya (Fig S3d9). The tree of taxa based on the evolution space agrees withthe three domain tree rather than the eocyte tree [26] [27] [28] [29] [30]. This result is accordant with theobservations that the features of genomic codon distributions between Crenarchaeota and Euryarchaeota arealso more similar than that between Euryarchaeota and Eukarya in Subsection 2.2.
23
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
Life originated in the chiral chemistry in the roadmap scenario. The basic pattern of the genetic codedetermines the main branches of the tree of life, so biodiversity is an inborn feature of life. Thus the primordialecosystem originated from the evolution of the genetic code, and the living system is a natural expansion of theprimordial ecosystem. The living system of many species can adapt to the non-chiral surroundings throughthe birth and death of diverse individuals in the ecological cycle; however a system with only one speciesnever has had chance to survive in the non-chiral surroundings. Hence the last universal common ancestorbecomes unnecessary in the theories of evolution. And the phylogenetic relationships among species oughtto form a circular network rather than a tree.
24
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
3 Reconstruction of the Phanerozoic biodiversity curve
3.1 Explanation of the trend of the Phanerozoic biodiversity curve based on genomesize evolution
Evolution has taken place and is continuing to occur at both the species level and the sequence level. Historyarchives of biodiversity on the earth have been stored not only in the fossil records at the species level [31] [32][33] but also in the genomes of contemporary species at the sequence level. A Phanerozoic biodiversity curvecan be plotted based on the fossil records [32] (Fig 4C, S4a8). The growth trend of this curve is exponential[32] (Fig 4C, S4a8), and the deviations from the exponential trend represent extinctions and originations [32](Fig 4B, S4a8). Environmental changes substantially influence the evolution of life at the species level, butdo not directly impact on the molecular evolution. Genome sizes are the most coarse-grained representationsof species at the sequence level. So, the evolution of life can be outlined by the evolution of genome size.And the genome size increased mainly via large scale duplications. Additionally considering the relativeinvariance of the genomic codon distributions in the duplications (Fig S3a4), the features of species at thesequence level do not change in general through genome size evolution. So, it is relatively independentbetween the evolution at the species level and the evolution at the sequence level. In this paper, a split andreconstruction scheme is proposed to explain the Phanerozoic biodiversity curve (Fig 4C, S4a8, S4c1): theexponential growth trend of the curve is postulated to be driven by the genome size evolution (Fig 4A), whilethe extinctions and originations in the fluctuations of the curve is postulated due to the Phanerozoic eustaticand climatic changes (Fig 4B).
It is necessary to shed light on the genome size evolution. It will be shown that the growth trend ofgenome size evolution is also exponential in general [34] [35]. The genome size distributions of species incertain taxa generally satisfy logarithmic normal distributions, according to the statistical analysis of genomesizes in the databases Animal Genome Size Database [36] and Plant DNA C-values Database [37] (Fig S4a1).The genome size distributions for the 7 higher taxa of animals and plants, namely, diploblastica, protostomia,deuterostomia, bryophyte, pteridophyte, gymnosperm and angiosperm, generally satisfy logarithmic normaldistributions, respectively. The genome size distribution for all the species in the two databases satisfieslogarithmic normal distribution very well, due to the additivity of normal distributions (Fig S4a1).
The logarithmic normal distributions of genome sizes for taxa can be simulated by a stochastic model (FigS4a2). Based on the duplication mechanism, the genome size evolution can be simulated by the stochasticmodel. The genome sizes obtained at a certain time in the simulation represent the genome sizes of speciesin a taxon at that time, which satisfy logarithmic normal distribution. The simulation indicates that, along thedirection of time, the logarithmic mean of genome sizes for a taxon GlogMean increases, and the logarithmicstandard deviation of genome sizes for a taxon GlogS D also increases (Fig S4a2). According to the simulation,the logarithmic mean of genome sizes for the ancestor of a taxon at the origin time GorilogMean can be estimatedby the genome sizes at present. Details are as follows. When genome size evolves in an exponential growth
25
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
trend, GorilogMean at the origin time is always less than GlogMean at present for a certain taxon. The greaterGlogS D is (e.g. GB
logS D > GAlogS D in Fig S4a2), the earlier the origin time is (e.g. tB
origin < tAorigin in Fig S4a2), and
thereby the less GorilogMean is (e.g. GBorilogMean < GA
orilogMean in Fig S4a2). Thus, the genome size for ancestor atthe origin time GorilogMean is obtained by calculating the genome sizes at present: GlogMean minus GlogS D timesan undetermined factor χ (Fig S4a2, S4a7).
Based on the genome sizes of contemporary species in the databases, the logarithmic means of genomesizes GlogMean and the logarithmic standard deviations of genome sizes GlogS D are obtained for the 7 highertaxa of animals and plants, respectively (Fig S4a7). The genome sizes for the ancestors of the 7 higher taxaat their origin times GorilogMean are less than the corresponding GlogMean at present, whose differences can beestimated by the corresponding GlogS D times an undetermined factor χ, respectively (Fig S4a7). The trend ofthe genome size evolution cannot be worked out just only based on the genome sizes of contemporary species.Furthermore, a chronological reference has to be introduced here. The origin times for taxa can be estimatedaccording to the fossil records. In this paper, the origin times for the 7 higher taxa are assumed as followsrespectively: diploblastica, 560 Ma; protostomia, 542 Ma (Ediacaran-Cambrian); deuterostomia, 525 Ma;bryophyte, 488.3 Ma (Cambrian-Ordovician); pteridophyte, 416.0 Ma (Silurian-Devonian); gymnosperm,359.2 Ma (Devonian-Carboniferous); angiosperm, 145.5 Ma (Jurassic-Cretaceous) [38] [39] [40]. The mainresults in this paper cannot be affected substantially by the disagreements among the origin times in theliteratures or the expansion of the databases. A rough exponential growth trend of genome size evolutionis observed in the plot whose abscissa and ordinate represent the origin time and GlogMean respectively (FigS4a7), where the differences between the genome sizes at the origin time and that at present are not consideredyet. A more reasonable plot is introduced by taking GorilogMean rather than GlogMean as the ordinate (Fig S4a7).In the calculation, GorilogMean are obtained respectively for the 7 higher taxa by (Fig S4a7)
GorilogMean = GlogMean − χ ·GlogS D.
According to the regression analysis, a regression line approximates the relationship between the origin timeand GorilogMean (Fig S4a7). Let the intercept of the regression line be the logarithmic mean of genome sizesfor all the species in the databases, then the undetermined factor is χ = 1.57 (Fig S4a7). The slope of theregression line represents the exponential growth speed of genome size evolution, which was doubled foreach 177.8 Ma (Fig S4a7, 4A). It does not violate common sense that the genome size at the time of theorigin of life about 3800 Ma is about several hundreds base pairs, according to the regression line.
There are more evidences to support the method for calculating the ancestors’ genome size based on thedata of contemporary species. The values of GlogMean, GlogS D and GorilogMean are obtained for the 19 animaltaxa or for the 53 angiosperm taxa, respectively (Fig S4a3). The numbers of species in these taxa are greatenough for statistical analysis. The chronological order for the 19 animal taxa is obtained by comparingGorilogMean (Fig S4a5). The order of the three superphyla is suggested to be: 1 diploblastica, 2 protostomia, 3deuterostomia, which can be inferred from the chronological order of the 19 animal taxa and the superphylumaffiliation of the 19 taxa (Fig S4a5). This result supports the three-stage pattern in Metazoan origination basedon fossil records [41] [42] [43] [44] [45] [46] [47] [48]. The chronological order for the 53 angiosperm taxa
26
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
is also obtained by comparing GorilogMean (Fig S4a6), which confirms the common sense that dicotyledoneaeoriginated earlier than monocotyledoneae. Let the abscissa and ordinate represent respectively GorilogMean
and GlogMean for the 7 higher taxa and 19 + 53 taxa, where the variation ranges of logarithm of genomesizes GlogMean ± GlogS D were also indicated on the ordinate. The upper triangular distribution of the genomesize variations thus obtained (Fig S4a4) reveals intrinsic features of the genome size evolution. The greaterGorilogMean is, the greater GlogMean is, and the less GlogS D is (Fig S4a4). The upper vertex of the triangle indicatesapproximately the upper limit of genome sizes of contemporary species. The upper triangular distribution(Fig S4a4) can be explained by the simulation of genome size evolution (Fig S4a2). The simulation showsthat GorilogMean and GlogMean increase along the evolutionary direction; the earlier the time of origination is,the greater GlogS D is (Fig S4a2, S4a4).
The exponential growth trend of Phanerozoic biodiversity curve can be explained by the exponentialgrowth trend of genome size evolution. Bambach et al. obtained the Phanerozoic biodiversity curve basedon fossil records, which satisfies the exponential growth trend (Fig 4C). The number of genera was doubledfor each 172.7 Ma in the Phanerozoic eon (Fig 4A). The genome size was doubled for each 177.8 Ma in thePhanerozoic eon (Fig 4A). The exponential growth speed for the number of genera agrees with the exponentialgrowth speed for the genome size. It is conjectured that the genome size evolution at the sequence level drivesthe evolution of life at the species level.
Note that Bambach et al.’s curve is chosen as the Phanerozoic biodiversity curve in this paper. It is basedon the following considerations. Based on fossil records, Sepkoski obtained the Phanerozoic biodiversitycurve at family level for the marine life first [49], then obtained the curve at genus level [50]. Sepkoski did notinclude single fossil records in order to reduce the impact on the curve at genus level from the samplings andtime intervals. In order to reduce such an impact furthermore, Bambach et al. chose only boundary-crossingtaxa in construction of the Phanerozoic biodiversity curve at genus level [32]. Rohde and Muller obtained acurve for all genera and another curve for well-resolved genera (removing genera whose time intervals areepoch or period, or single observed genera) respectively based on Sepkoskis Compendium, which can roughlybe taken as the upper limit and lower limit of the Phanerozoic curve (Fig 4C). These Phanerozoic biodiversitycurves are roughly in common in their variations, from which the five mass extinctions O-S, F-F, P-Tr, Tr-J,K-Pg can be discerned. There are several models to explain the growth trend of the Phanerozoic biodiversitycurve. Sepkoski and other co-workers tried to explain the growth trend by 2-dimensional Logistic model [51][52] [53]. Benton et al. explained it by exponential growth model [54]. The Bambach et al.’s curve is lowerin the Paleozoic era, and obviously higher in the Cenozoic era than Sepkoski’s curve. It is more obvious thatthe Bambach et al.’s curve is in exponential growth trend than Sepkoski’s curve.
The exponential growth trend and the net fluctuations of the Bambach et al.’s biodiversity curve can beexplained at both the molecular level and the species level respectively by a split and reconstruction scheme(Fig S4a8). In this scheme, the exponential growth trend of biodiversity is attributed to the driving force fromthe genome size evolution, in view of the approximate equality between the growth speed of number of generaand the growth speed of genome size (Fig 4A). The homochirality of life, the genetic code and the tree of life
27
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
are generally invariant with the duplications of genome sizes. It maintains the features of life at the sequencelevel. Expansion of genome size provides conditions at the sequence level for the innovation of biodiversityat the species level. According to the split and reconstruction scheme, the extinctions and originations bythe deviations from the exponential growth trend of the biodiversity curve is interpreted as a biodiversitynet fluctuation curve. An unnormalised biodiversity net fluctuation is obtained by subtracting the growthtrend from the logarithmic Phanerozoic biodiversity curve (Fig S4a8), and the biodiversity net fluctuationcurve (Fig 4B, S4b5) is defined by dividing its standard deviation from the biodiversity net fluctuation curve.The biodiversity net fluctuation curve can be explained by the Phanerozoic eustatic curve and Phanerozoicclimatic curve at the species level in the next subsection (Fig 4B).
3.2 Explanation of the variation of the Phanerozoic biodiversity curve based onclimatic and eustatic data
According to the split and reconstruction scheme for the Phanerozoic biodiversity curve, the growth trend ofbiodiversity is attributed to the genome size evolution at the sequence level, and the net fluctuations deviatedfrom the growth trend is attributed to the couplings among earth’s spheres. The history of biodiversity in thePhanerozoic eon can be outlined by the biodiversity net fluctuation curve (Fig 4B): the curve reached the highpoint after the Cambrian radiation and the Ordovician radiation; then the curve reached the low point after atriple plunge of the O-S, F-F and P-Tr mass extinctions; and afterwards the curve climbed higher and higherexcept for mass extinctions and recoveries around Tr-J and K-Pg. The biodiversity net fluctuation curve canbe reconstructed through a climato-eustatic curve based on climatic and eustatic data (Fig 4B), which is freefrom the fossil records. Accordingly, it is advised that the tectonic movement can account for the five massextinctions.
Numerous explanations on the cause of mass extinctions have been proposed, such as climate change,marine transgression and regression, celestial collision, large igneous province, water anoxia, ocean acidification,release of methane hydrates and so on. According to a thorough study of control factors on mass extinctions,it is almost impossible to explain all the five mass extinctions by a single control factor. Although the O-Smass extinction may be attributed to the glacier age, the other mass extinctions almost happened in hightemperature periods. The other glacier ages in the Phanerozoic eon had always avoided the mass extinctions.Although the mass extinctions tend to happen with marine regressions, it is commonly believed that no marineregression happened in the end-Changhsingian, the greatest mass extinction [55] [56] [57] [58] [59]. Largeigneous province may play roles only in the last three mass extinctions [60] [61]. The K-Pg mass extinctionwas most likely to be caused by celestial collision, but there are still significant doubts on this explanation[62] [63] [64]. According to the high resolution study of the mass extinctions based on fossil records, moreunderstandings on the complexity of mass extinctions arise. For example, the P-Tr mass extinction can bedivided into two phases: End-Guadalupian and end-Changhsingian [65] [66] [67] [68] [69]; furthermore theend-Changhsingian stage can be subdivided into main stage (B line) and final stage (C line) [70] [71], andthe course of extinction may happened very rapidly [72] [73] [74] [75] [76]. Such a complex pattern of the
28
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
mass extinction challenges any explanations on this problem. it is also almost impossible to attribute a massextinction to a single control factor. Based on the above discussions, it is wise to consider a multifactorialexplanation for the mass extinctions. In consideration of the interactions among the lithosphere, hydrosphere,atmosphere and biosphere, the tectonic movement of the aggregation and dispersal of Pangaea influenced thebiodiversity net fluctuation curve at several time scales through the impacts of hydrosphere and atmosphere(Fig 4B).
Before explaining the mass extinctions based on the tectonic movement, two preparations are required:first, to construct a Phanerozoic eustatic curve; second, to construct a Phanerozoic climatic curve.
The Phanerozoic eustatic curve is based on the following two results (Fig S4b3): (i) Haq sea level curve[77] [78], (ii) Hallam sea level curve [79]. In 1970s, Seismic stratigraphy developed a method called sequencestratigraphy to plot the sea level fluctuations. Vail in the Exxon Production Research Company obtained asea level curve based on this method, which concerns controversial due to business confidential data. Haqet al. replotted a sea level curve for the mesozoic era and cenozoic era also based on this method [77], anda sea level curve for the paleozoic era [78]. The Haq sea level curve here is obtained by combining thesetwo curves (Fig S4b3). Hallam obtained a sea level curve by investigating large amounts of data on sea levelvariations [79] (Fig S4b3). Generally speaking, it is not quite different between the Haq sea level curve andthe Hallam sea level curve. Therefore, a Phanerozoic eustatic curve can be obtained by averaging the Haqsea level curve and the Hallam sea level curve after nondimensionalization at equal weights (Fig S4b3).
The Phanerozoic climatic curve is based on the following four results (Fig S4b2): (i) the climatic gradientcurve based on climate indicators in the Phanerozoic eon [80] [81] (Fig S4b1); (ii) the atmospheric CO2
fluctuations in the Berner’s models that emphasise the weathering role of tracheophytes [82] [83] [84]; (iii)the marine carbonate 87S r/86S r record in the Phanerozoic eon in Raymo’s model [85] [86]. The more positive87S r/86S r values, assumed to indicate enhanced CO2 levels and globally colder climates; more negative87S r/86S r values, indicating enhanced sea-floor speeding and hydrothermal activity, are associated withincreased CO2 levels; (iv) the marine δ18O record in the Phanerozoic eon based on thermodynamic isotopeeffect [86] [87]. The above four independent results are quite different to one another (Fig S4b2). Consideringthe complexity of the problem on the Phanerozoic climate change yet, all the four methods are reasonablein principle on their grounds respectively. A consensus Phanerozoic climatic curve is obtained based onthe above four independent results after nondimensionalization at equal weights. Namely, the Phanerozoicclimatic curve is defined as the average of nondimensionalized climate indicator curve [80] [81], Berner’sCO2 curve [82], 87S r/86S r curve [85] [86] and δ18O curve [86] [87] (Fig S4b2).
Nondimensionalization is a practical and reasonable method to obtain the consensus Phanerozoic climaticcurve. As the temperature in a certain time is concerned, when the four results agree to one other, thecommon temperatures are adopted; when the four results disagree, the majority consensus temperatures areadopted (Fig S4b2). According to the Phanerozoic climatic curve, the temperature was high from Cambrianto Middle Ordovician (Fig S4b2); the temperature was low during O-S (Fig S4b2); the temperature increases
29
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
from Silurian to Devonian, and it reach high during F-F (Fig S4b2); the temperature is extreme low duringD-C (Fig S4b2); the temperature rebounded slightly during Missisippian, and the temperature became loweragain during Pennsylvanian, and the temperature decreased to extreme low during early Permian (Fig S4b2);the temperature increase in Permian, and especially the temperature increase rapidly during the Lopingian,and reach an extreme high peak during P-Tr (Fig S4b2); the temperature is high from Lower Triassic toLower Jurassic, and it reached to extreme high during Lower Jurassic (Fig S4b2); the temperature becamelow from Middle Jurassic to Lower Cretaceous, where it is coldest in the Mesozoic era during J-K, whichwas not low enough to form glaciation (Fig S4b2); the temperature is high again from Upper Cretaceous toPaleocene, and the detailed high temperature during the Paleocene-Eocene event is observed (Fig S4b2); thetemperature decreased from Eocene to present, especially reached to the minimum in the Phanerozoic eonduring Quaternary (Fig S4b2). Four glacier ages are observed in the Phanerozoic eon (Fig S4b2): Hirnantianglacier age, Tournaisian glacier age, early Permian glacier age and Quaternary glacier age, especially bipolarice sheet at present. The Phanerozoic climatic curve also indicates the extreme low temperature during J-K(Fig S4b2), which corresponds to possible continental glaciers at high latitudes in the southern hemisphere[88]. In short, the fine-resolution Phanerozoic climatic curve agrees with the common sense on climatechange at the tectonic time scale.
The Phanerozoic climatic curve is taken as a standard curve to evaluate the four independent methods.The climate indicator curve is the best result among the four. It accurately reflects the four high temperatureperiods and four low temperature periods, and the varying amplitudes of the curve are relatively modestand reasonable. But it overestimated the temperature during Pennsylvanian (Fig S4b2). The δ18O curve isalso reasonable. It also reflects the four high temperature periods and four low temperature periods. But itoverestimated the temperature during Guadalupian, and underestimated the temperature during Jurassic (FigS4b2). The 87S r/86S r curve is reasonable to some extent. It reflects the high temperature during Mesozoicand low temperature during Carboniferous and Quaternary. But it underestimated the temperature duringCambrian, and overestimated the temperature during O-S and J-K (Fig S4b2). The Berner’s CO2 curve isrelatively rough. It reflects the high temperature during Mesozoic, But it overestimated the temperature fromCambrian to Devonian (Fig S4b2).
As the climate changes at the tectonic time scale is concerned, the Berner’s CO2 curve indicates relativelylong term climate changes; the climate indicator curve indicates relatively mid term climate changes; andthe 87S r/86S r and δ18O curves indicate relatively short term climate changes (Fig S4b2). The scale of thevariation of the Phanerozoic climatic curve is relatively accurate because this consensus climatic curve isobtained by combining all the different time scales of the above four climatic curves. Four high temperatureperiods Cambrian, Devonian, Triassic, K-Pg and four low temperature periods O-S, Carboniferous, J-K,Quaternary appeared alternatively in the Phanerozoic climatic curve, which generally reflect four periods inthe climate changes in the Phanerozoic eon (Fig S4b2). Such a time scale of the four periods of climatechanges is about equal to the average interval of mass extinctions, and is also equal to the time scale of thefour periods of the variation of the relative abundance of microbial carbonates in reef during Phanerozoic [89][90] [91] [92]. The agreement of the time scales indicates the relationship between the climate change and
30
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
the fluctuations of biodiversity in the Phanerozoic eon.
The sea level fluctuations and the climate change have an important impact on the fluctuations of biodiversity.With respect to the impact of the sea level fluctuations, this paper follows the traditional view that sea levelfluctuations play positive contributions to the fluctuations of biodiversity. With respect to the impact ofthe climate change, this paper believes that climate change plays negative contributions to the fluctuationsof biodiversity. The climate gradient curve is opposite to the climate curve (Fig S4b2). The higher thetemperature is, the lower the climate gradient is, and consequently the less the global climate zones are; whilethe lower the temperature is, the higher the climate gradient is, and consequently the more the global climatezones are. When the climate gradient is high, there were tropical, arid, temperate and frigid zones among theglobal climate zones, which boosts the biodiversity. So, this paper believes that the biodiversity increases withthe climate gradient. This point of view agrees with the following observations. The biodiversity reboundedduring the low temperature period Carboniferous (Fig S4b5), and increased steadily during the ice age duringQuaternary (Fig S4b5); and the four mass extinctions F-F, P-Tr, Tr-J, K-Pg appeared in high temperatureperiods (Fig S4b5). Although the O-S mass extinction occurred in the ice age (Fig S4b5), this case is quitedifferent from other mass extinctions. In the first mass extinction during O-S, mainly the Cambrian fauna,which originated in the high temperature period, cannot survive from the O-S ice age. Mass extinctions nolonger occurred thereafter in the ice ages for those descendants who have successfully survived from the O-Sice age (Fig S4b5).
Based on the above analysis, the eustatic curve and the climatic gradient curve is positively related to thefluctuations of biodiversity in the Phanerozoic eon. Accordingly, a climato-euctatic curve is defined by theaverage of the nondimensionaliezed eustatic curve and the nondimensionaliezed climatic grandaunt curve,which indicates the impact of environment on biodiversity (Fig S4b4). It should be emphasised that theclimato-euctatic curve is by no means based on fossil records, which is only based on climatic and eustaticdata. It is obvious that the climato-euctatic curve agrees with the biodiversity net fluctuation curve (Fig4B). Judging from the overall features, the outline of the history of the Phanerozoic biodiversity can bereconstructed in the climato-eustatic curve, as far as the Cambrian radiation and the Ordovician radiation, thePaleozoic diversity plateau, the P-Tr mass extinction, and the Mesozoic and Cenozoic radiation are concerned(Fig 4B). Judging from the mass extinctions, the five mass extinctions O-S, F-F, P-Tr, Tr-J, K-Pg can bediscerned from the five steep descents in the climato-eustatic curve respectively (Fig 4B, S4b7). As to K-Pgespecially, there is obviously a pair of descent and ascent around K-Pg in the climato-eustatic curve (Fig4B), so celestial explanations are actually not necessary here. Judging from the minor extinctions, mostof the minor extinctions can be discerned in the climato-eustatic curve in details, as far as the following areconcerned: 3 extinctions during Cambrian, Cm-O extinction, S-D extinction, 2 extinctions during Carboniferous,1 extinction during Permian, 1 extinction during Triassic, 3 extinctions during Jurassic, 2 extinctions duringCretaceous, 2 extinctions during Paleogene, 1 extinction during Neogene and the extinction at present (Fig4B, S4b7). Judging from the increment of the biodiversity, the originations and radiations can be discernedfrom the ascents in the climato-eustatic curve, as far as the following are concerned: the Cambrian radiation,the Ordovician radiation, the Silurian radiation, the Lower Devonian diversification, the Carboniferous rebound,
31
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
the Triassic rebound, the Lower Jurassic rebound, the Middle Cretaceous diversification, the Paleogenerebound (Fig 4B, S4b7). The derivative of the climato-eustatic curve indicates the variation rates (Fig 4B,S4b7). The extinctions are discerned by negative derivative, and the rebounds and radiations by positivederivative. It is obvious to discern the five mass extinctions, most of minor extinctions, and rebounds andradiations in the derivative curve of the climato-eustatic curve, just like in the climato-eustatic curve (Fig 4B,S4b7).
The interactions among the lithosphere, hydrosphere, atmosphere, biosphere are illustrated by the eustaticcurve, climatic curve, biodiversity net fluctuation curve and the aggregation and dispersal of Pangaea (FigS4b5, S4b6). There were about two long periods in the eustatic curve, which were divided by the event of themost aggregation of Pangaea (Fig S4b5). The first period comprises the grand marine transgression from theCambrian to the Ordovician and the grand marine regression from the Silurian to the Permian, and the secondperiod comprises the grand marine transgression during Mesozoic and the grand regression during Cenozoic(Fig S4b5). These observations reveal the tectonic cause of the grand marine transgressions and regressionsin the Phanerozoic eon, considering the aggregation and dispersal of Pangaea. The coupling between thehydrosphere and the atmosphere can be comprehended by comparing the eustatic curve and the climatic curve.The global temperature was high during both the grand marine transgressions (Fig S4b5). And all the four iceages in the Phanerozoic occurred during the grand marine regressions (Fig S4b5). Generally speaking, theclimate gradient curve varies roughly synchronously with the eustatic curve from the Cambrian to the LowerCretaceous along with the aggregation of Pangaea, except for the regression around the Tournaisian ice ages(Fig S4b5). And the climate gradient curve varies roughly oppositely with the eustatic curve thereafter fromthe Upper Cretaceous to the present along with the dispersal of Pangaea (Fig S4b5); namely the climategradient curve increases as the eustatic curve decreases. These observations reveal the tectonic cause of theclimate change at tectonic time scale in the Phanerozoic eon, considering the impact on the climate from thegrand transgressions and regressions. The biodiversity responses almost instantly to both the eustatic curveand the climatic curve, because the life spans of individuals of any species are much less than the tectonic timescale. Hence, the biodiversity net fluctuation curve agrees with the climato-eustatic curve very well (Fig 4B).In conclusion, the tectonic movement has substantially influenced the biodiversity through the interactionsamong the earth’s spheres (Fig S4b5, S4b6).
The interactions among the earth’s spheres can explain the outline of the biodiversity fluctuations in thePhanerozoic eon. During Cambrian and Ordovician, both the sea level and the climate gradient increased, sothe biodiversity increased (Fig S4b5). From Silurian to Permian, both the sea level and the climate gradientdecreased, so the biodiversity decreased (Fig S4b5), except for the biodiversity plateau during Carboniferousdue to the opposite increasing climate gradient and decreasing sea level (Fig S4b5). From Triassic to LowerCretaceous, both the sea level and the climate gradient increased, so the biodiversity increased (Fig S4b5).From Upper Cretaceous to present, the sea level decreased and the climate gradient increased rapidly, so thebiodiversity increased (Fig S4b5).
The interactions among the earth’s spheres can explain the five mass extinctions within the same theoretical
32
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
framework, though their causes are different in detail. The mass extinctions O-S, K-Pg occurred around thepeaks in the eustatic curve respectively, and the P-Tr mass extinction around the lowest point in the eustaticcurve. There were about two periods in the eustatic curve and about four periods in the climatic curve in thePhanerozoic eon. So the period of climate change was about a half of the period of sea level fluctuations, as faras their changes at the tectonic time scale are concerned (Fig S4b5). By comparison, the F-F mass extinctionoccurred in the low level of climate gradient. There was a triple plunge from the high level of biodiversityduring the end-Ordovician to the extreme low level of biodiversity during Triassic in the biodiversity netfluctuation curve, which comprised a series of mass extinctions O-S, F-F, P-Tr (Fig 4B). It is interesting that arelatively less great extinction may occur after each of the triple plunge mass extinctions respectively: namelythe S-D extinction after the O-S mass extinction, the Carboniferous extinction after the F-F mass extinction,and the Tr-J mass extinction after the P-Tr mass extinction (Fig 4B). Especially during the end of Permian,the aggregation of the Pangaea pull down both the eustatic curve and the climate gradient curve, and hencepull down the biodiversity curve (Fig S4b5). After a long continual descent in the biodiversity net fluctuationcurve and after the two mass extinctions O-S, F-F, the ecosystem was more fragile at the end of Permian thanin the previous period when biodiversity was at the high level, therefore the P-Tr mass extinction occurred asthe most severe extinction event in the Phanerozoic eon (Fig 4B). The P-Tr mass extinction can be dividedinto the first phase at the end of Guadalupian and the second phase at the end of Changhsingian (Fig 4B).The extinction of Fusulinina mainly occurred at the end of Guadalupian; while the extinction of Endothyrinamainly occurred at the end of Changhsingian. Note that the number of Endothyrina did not decrease obviouslyat the first phase. Different extinction patterns in the above two phases indicate that the triggers for the twophases are different to each other. Marine regression occurred at the end of the Guadalupian, whereas itis indicated that no marine regression occurred at the end of the Changhsingian. The second phase of theend-Changhsingian extinction may be caused by the declining climate gradient (or increasing temperature).All the four independent methods in the above (climatic-sensitive sediments, Berner’s CO2, 87S r/86S r andδ18O) predict the rapidly increasing temperature at the end of the Changhsingian (Fig S4b2). Either Siberiatrap or the methane hydrates can raise the temperature at the end of the Changhsingian, which may trigger theP-Tr mass extinction. Four mass extinctions F-F, P-Tr, Tr-J, K-Pg and the Cm-O extinction occurred duringthe low level of climate gradient (Fig S4b5); as a special case, the O-S mass extinction occurred during thehigh level of climate gradient (Fig S4b5). Taken together, these observations show that low climate gradientcan contribute to the mass extinctions substantially. Generally speaking, the mass extinctions might occurcoincidentally when brought together two or more factors among the low climate gradient, marine regression,and the like (Fig S4b5).
The interactions among the earth’s spheres can explain the minor extinctions, radiations and rebounds.The minor descents in the biodiversity net fluctuation curve relate to mid-term or short-term changes insea level and climate at the tectonic time scale, which correspond approximately to the minor extinctionsrespectively [93] [94] (Fig 4B, S4b7). It is shown that most minor extinctions in the Paleozoic era correspondedto low climate gradient, and most minor extinctions during the Mesozoic and Cenozoic corresponded tomarine regressions (Fig 4B, S4b7). After mass extinctions, both the climate gradient and sea level tended to
33
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
increase; henceforth the radiations and rebounds occurred. Especially for the case of the P-Tr mass extinction,the climate gradient remained low throughout the Lower Triassic (Fig 4B, S4b5), so it took a long time forthe recovery from the P-Tr mass extinction.
3.3 Reconstruction of the Phanerozoic biodiversity curve and the strategy of theevolution of life
Now, let’s reconstruct the Phanerozoic biodiversity curve. In Subsection 3.1, the exponential growth trend ofthe Phanerozoic biodiversity curve has been explained by the exponential growth trend of the genome sizeevolution (Fig 4A); in Subsection 3.2, the deviation from the exponential growth trend of the Phanerozoicbiodiversity curve has been explained by the climato-eustatic curve (Fig 4B). By combining the exponentialgrowth trend of the genome size evolution and the climato-eustatic curve, the biodiversity curve based onfossil records can be reconstructed based on the genomic, climatic and eustatic data (Fig 4C). The techniquedetails are as follows (Fig S4c1). The reconstructed biodiversity net fluctuation curve is obtained based on theclimato-eustatic curve times a certain weight, where the weight is the standard deviation of the biodiversitynet fluctuation curve (Fig S4c1). And the slope and the intercept of the line that corresponds to the exponentialgrowth trend of the reconstructed biodiversity is obtained as follows. Its slope is defined by the slope of theline that corresponds to the exponential growth trend of genome size evolution (Fig S4c1), and its interceptis defined by the intercept of the line that corresponds to the exponential growth trend of the biodiversitycurve (Fig S4c1). The reconstructed biodiversity curve thus obtained agrees nicely with the Bambach et al.’sbiodiversity curve within the error range of fossil records (Fig 4C).
The above split and reconstruction scheme of the Phanerozoic biodiversity curve can explain not onlythe exponential growth trend and the detailed fluctuations, but also the roughly synchronously decliningorigination rate and extinction rate throughout the Phanerozoic eon [95] [96] [97] [98] [99] (Fig 4D, S4d1,S4d2, S4d3). Ramp and Sepkoski first found the declining extinction rate throughout the Phanerozoic eon[100] [101]. And Sepkoski also found the declining origination rate throughout the Phanerozoic eon [94].The numbers of origination genera and extinction genera are marked in the Bambach et al.’s biodiversitycurve. The origination rate and extinction rate are obtained from the numbers of origination or extinctiongenera divided by the corresponding time intervals (Fig S4d3). The difference between the origination rateand extinction rate is the variation rate of number of genera (Fig S4d3). It is shown that both the originationrate and extinction rate decline roughly synchronously throughout the Phanerozoic eon [102] (Fig S4d3).
The rough synchronicity between the origination rate and the extinction rate can be explained according tothe split and reconstruction scheme. The Phanerozoic biodiversity curve can be divided into two independentparts: an exponential growth trend and a biodiversity net fluctuation curve (Fig S4a8, S4c1). Notice thatthere is no declining trend in the climato-eustatic curve, the expectation value of which is constant (Fig 4B,S4d1); whereas the biodiversity curve was continuously boosted upwards by the exponential growth trendof genome size evolution (Fig 4A). The expectation value of the difference between the origination rate
34
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
and extinction rate corresponds to the exponential growth trend of biodiversity boosted by the genome sizeevolution (Fig S4d1, S4d2). If too large or too small origination rate brings the biodiversity to deviate fromthe exponential growth trend, the extinction rate has to response by increasing or decreasing so as to maintainthe constant exponential growth trend of the biodiversity (Fig S4d2, S4d3). The rough synchronicity betweenthe origination rate and the extinction rate shows that the growth trend of biodiversity at the species level wasconstrained strictly by the exponential growth trend of genome size evolution at the sequence level.
The declining trends of the origination rate and the extinction rate can be explained according to the splitand reconstruction scheme. It is necessary to assume that there is an upper limit of ability to create newspecies in the living system, which is due to the constraint of the earth’s environment at the species leveland the limited mutation rate at the sequence level (Fig S4d1). Although there is no declining trend forthe variation rate of the biodiversity net fluctuation curve (Fig S4b7, S4d1), the trend of the variation rateof the Phanerozoic biodiversity curve tends to decline in general, because the exponential growth trend ingenome size evolution is added in the denominator in the calculations (Fig 4D, S4d2, S4d3). The decliningvariation rate throughout the Phanerozoic eon shows that the genome size evolution played a primary role inthe growth of biodiversity. The exponential growth trend of biodiversity driven by the genome size evolutioncan gradually weaken the impact on the biodiversity fluctuations from the changes of sea level and climate(Fig 4D, S4d2). The rough synchronicity between the origination rate and the extinction rate can explainboth the declining origination rate and the declining extinction rate throughout the Phanerozoic eon (FigS4d2, S4d3).
The biodiversity evolves independently at the sequence level and at the species level. This ensuresthe adaptation of life to the earth’s changing environment. The essential feature of the living system isits homochirality. At the beginning of life, the homochirality, the genetic code, and the genomic codondistributions had established. The homochirality ensures the safety for the living system: the earth’s changingenvironment influenced the living system only at the species level, which cannot directly threaten the existenceof life at the sequence level. The strategy of adaptation is that: life generates numerous redundant speciesat the sequence level; thus life sacrificed opportunity of survival for the vanished majority of the species inexchange for the present outstanding adaptability to the changing environment. By virtue of this strategy ofadaptation, the chiral living system may adapt to a wide range of environmental changes on planets.
35
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
References
[1] V. N. Soyfer, V. N. Potaman, Triple-Helical Nucleic Acids (Springer-Verlag, New York, 1996).
[2] B. P. Belotserkovskii et al., Formation of intramolecular triplex in homopurine-homopyrimidine mirror repeatswith point substitutions. Nucleic Acids Research 18, 6621-6624 (1990).
[3] J. T. Wong, Coevolution theory of the genetic code at age thirty. BioEssays 27, 416-425 (2005).
[4] E. N. Trifonov et al., Primordia vita. deconvolution from modern sequences. Orig Life Evol Biosph 36, 559-565(2006).
[5] M. Di Giulio, On the origin of the transfer RNA molecule. J. Theor. Biol. 159, 199-214 (1992).
[6] M. Di Giulio, Was it an ancient gene codifying for a hairpin RNA that, by means of direct duplication, gave riseto the primitive tRNA molecule? J. Theor. Biol. 177, 95-101 (1995).
[7] M. Di Giulio, The nonmonophyletic origin of tRNA molecule. J. Theor. Biol. 197, 403-414 (1999).
[8] M. Di Giulio, The origin of the tRNA molecule: Implications for the origin of protein synthesis. J. Theor. Biol.226, 89-93 (2004).
[9] M. Di Giulio, Nanoarchaeum equitans is a living fossil. J. Theor. Biol. 242, 257-260 (2006).
[10] R. F. Gesteland et al. eds., The RNA World (Cold Spring Harbor Laboratory, New York, 2006), Third Edition.
[11] G. Eriani et al., Partition of tRNA synthetases into two classes based on mutually exclusive sets of sequencemotifs. Nature 347, 203-206 (1990).
[12] Y. Nambu and G. Jona-Lasinio, Dynamical Model of Elementary Particles Based on an Analogy withSuperconductivity. I. Phys. Rev. 122, 345-358 (1961).
[13] Y. Nambu and G. Jona-Lasinio, Dynamical Model of Elementary Particles Based on an Analogy withSuperconductivity. II. Phys. Rev. 124, 246-254 (1961).
[14] A. Muto, S. Osawa, The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl.Acad. Sci. U.S.A. 84, 166-169 (1987).
[15] A. Gorban, Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences.Physica A 353, 365-387 (2005).
[16] E. N. Trifonov et al., Distinc stage of protein evolution as suggested by protein sequence analysis. J. Mol. Evol.53, 394-401 (2001).
[17] E. N. Trifonov, The triplet code from first principles. Journal of Biomolecular Structure & Dynamics 22:1 (2004).
[18] I. S. Povolotskaya et al., Stop codons in bacteria are not selectively equivalent. Biology Direct 7:30 (2012).
[19] J. T. Wong, A coevolution theory of the genetic code. Proc. Natl. Acad. Sci. U.S.A. 72, 1909-1912 (1975).
[20] J. T.-F. Wong, A. Lazcano, Prebiotic Evolution and Astrobiology (Landes Bioscience, Austin Texas, 2009).
[21] D. J. Li, S. Zhang, Genetic code evolution as an initial driving force for molecular evolution. Physica A 388,3809-3825 (2009).
[22] E. N. Trifonov, Consensus temporal order of amino acids and evolution of the triplet code. Gene 261, 139-151(2000).
36
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
[23] F. Delsuc et al., Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6, 361-375 (2005).
[24] J. Qi et al., Whole genome prokaryote phylogeny without sequence alignment: A K-string composition approach.J. Mol. Evol. 58, 1-11 (2004).
[25] J. Felsenstein, Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17,368-376 (1981).
[26] C. R. Woese et al., Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, andEucarya. Proc. Natl. Acad. Sci. U.S.A. 87, 4576-4579 (1990).
[27] J. A. Lake et al., Eocytes: A new ribosome structure indicates a kingdom with a close relationship to eukaryotes.Proc. Natl. Acad. Sci. U.S.A. 81, 3786-3790 (1984).
[28] J. A. Lake, Origin of the eukaryotic nucleus determined by rate-invariant analysis of rRNA sequences. Nature331, 184-186 (1988).
[29] J. M. Archibald, The eocyte hypothesis and the origin of eukaryotic cells. Proc. Natl. Acad. Sci. U.S.A. 105,20049-20050 (2008).
[30] C. J. Cox et al., The archaebacterial origin of eukaryotes. Proc. Natl. Acad. Sci. U.S.A. 105, 20356-20361 (2008).
[31] J. J. Sepkoski, Jr., A compendium of fossil marine animal genera. Bulletins of American Paleontology No. 363(2002).
[32] R. K. Bambach et al., Origination, extinction, and mass depletions of marine diversity. Paleobiology 30, 522-542(2004).
[33] R. A. Rohde, R. A. Muller, Cycles in fossil diversity. Nature 434, 208-210 (2005).
[34] A. A. Sharov, Genome increases as a clock for the origin and evolution of life. Biology Direct 1, 17 (2007).
[35] D. J. Li, S. Zhang, The Cambrian explosion triggered by critical turning point in genome size evolution.Biochemical and Biophysical Research Communications 392, 240-245 (2010).
[36] T. R. Gregory, Animal Genome Size Database, http://www.genomesize.com.
[37] M. D. Bennett, I. J. Leitch, Plant DNA C-values database (release 5.0, Dec. 2010), http://www.kew.org/cvalues/.
[38] K. J. Willis, J. C. McElwain, The Evolution of Plants (Oxford Univ. Press, Oxford, 2014).
[39] M. R. Gibling, N. S. Davies, Palaeozoic landscapes shaped by plant evolution. Nature Geoscience 5, 99-105(2012).
[40] T. L. P. Couvreur, Early evolutionary history of the flowering plant family Annonaceae: steady diversificationand boreotropical geodispersal. Journal of Biogeography 38, 664-80 (2011).
[41] D. Shu, Cambrian explosion: Birth of tree of animals. Gondwana Research 14, 219-240 (2008).
[42] S. Conway-Morris, The fossil record and the early evolution of the metazoa. Nature 361, 219-225 (1993).
[43] S. Conway-Morris, The Burgess shale fauna and the Cambrian explosion. Science, 339-346 (1989).
[44] G. Budd, S. Jensen, A critical reappraisal of the fossil record of the bilaterian phyla. Biological Review 75,253-295 (2000).
[45] D. Shu, On the phylum vetulicolia. Chinese Science Bulletin 50, 2342-2354 (2005).
37
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
[46] D. Shu et al., Ancestral echinoderms from the Chengjiang deposits of China. Nature 430, 422-428 (2004).
[47] J. W. Valentine, How were vendobiont bodies patterned? Palaeobiology 27, 425-428 (2001).
[48] D. Shu et al., Restudy of cambrian explosion and formation of animal tree. ACTA Palaeontologica Sinica 48,414-427 (2009).
[49] J. J. Sepkoski, Jr., A factor analytic description of the Phanerozoic marine fossil record. Paleobiology 7, 36-53(1981).
[50] J. J. Sepkoski, Jr., Biodiversity: Past, present, and future. Journal of Paleontology 71, 533-539 (1997).
[51] J. J. Sepkoski, Jr., A kinetic model of Phanerozoic taxonomic diversity I, Analysis of marine orders. Paleobiology4, 223-251 (1978).
[52] J. J. Sepkoski, Jr., A kinetic model of Phanerozoic taxonomic diversity II, Early Phanerozoic families andmultiple equilibria. Paleobiology 5, 222-251 (1979).
[53] J. J. Sepkoski, Jr., A kinetic model of Phanerozoic taxonomic diversity III, Post-Paleozoic families and multipleequilibria. Paleobiology 10, 246-267 (1984).
[54] D. Hewzulla et al., Evolutionary patterns from mass originations and mass extinctions. Philos. Trans. R. Soc.London Ser. B 354, 463-469 (1999).
[55] P. B. Wignall, A. Hallam, Anoxia as a cause of the Permian/Triassic extinction: facies evidence from northernItaly and the western United States. Palaeogeography, Palaeoclimatology, Palaeoecology 93, 21-46 (1992).
[56] P. B. Wignall, A. Hallam, Griesbachian (Earliest Triassic) Palaeoenvironmental changes in the Salt Range,Pakistan and southeast China and their bearing on the Permo-Triassic mass extinction. Palaeogeography,Palaeoclimatology, Palaeoecology 102, 215-237 (1993).
[57] P. B. Wignall, R. J. Twitchett, Oceanic anoxia and the end Permian mass extinction. Science 272, 1155-1158(1996).
[58] H. Yin, J. Tong, Multidisciplinary high-resolution of the Permian-Triassic boundary. Palaeogeography,Palaeoclimatology, Palaeoecology 143, 199-211 (1998).
[59] D. H. Erwin et al., End-Permian mass extinction: A review. In C. Koeberl and K. G. MacLeod eds., CatastrophicEvents and Mass Extinctions: Impacts and Beyond. Geological Society of America Special Paper 356, 363-383(2002)
[60] P. R. Renne, A. R. Basu, Rapid eruption of the Siberia Traps flood basalts at the Permo-Triassic boundary.Science 253, 176-179 (1991).
[61] I. H. Campbell et al., Synchronism of the Siberia Traps and the Permian-Triassic boundary. Science 258,1760-1763 (1992).
[62] L. W. Alvarez et al., Extraterrestrial cause for the Cretaceous? Tertiary extinction. Science 208, 1095-1108(1980).
[63] G. Keller et al., Chicxulub impact predates the K-T boundary mass extinction. Proc. Natl. Acad. Sci. U.S.A. 101,3753-3758 (2004).
[64] S. A. Stewart, P. J. Allen, A 20-km-diameter multi-ringed impact structure in the North Sea. Nature 418, 520-523(2002).
[65] Y. G. Jin, Two phases of the end-Permian extinction. Palaeoworld 1, 39 (1991).
38
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
[66] Y. G. Jin, The pre-Lopingian benthose crisis. Compte Rendu, the 12th ICC-P 2, 269-278 (1993).
[67] S. M. Stanley, X. Yang, A double mass extinction at the end of the Paleozoic era. Science 266, 1340-1344 (1994).
[68] S. Shen et al., Calibrating the End-Permian mass extinction. Science 334, 1367-1372 (2011).
[69] Y. G. Jin et al., Pattern of marine mass extinction near the Permian-Triassic boundary in south China. Science289, 432-436 (2000).
[70] S. Xie et al., Two episodes of microbial change coupled with Permo/Triassic faunal mass extinction. Nature 434,494-497 (2005).
[71] Z. Chen, M. J. Benton, The timing and pattern of biotic recovery following the end-Permian mass extinction.Nature Geoscience 5, 375-383 (2012).
[72] S. A. Bowring et al., The tempo of mass extinction and recovery: The end Permian example. Proc. Natl. Acad.Sci. U.S.A. 96, 8827-8828 (1999).
[73] Y. Eshet et al., Fungal event and palynological record of the ecological crisis and recovery across thePermian-Triassic boundary. Geology 23, 967-970 (1995).
[74] M. R. Rampino et al., Tempo of the end-Permian event, High-resolution cyclostratigraphy at thePermian-Triassic boundary. Geology 28, 643-646 (2000).
[75] P. D. Ward et al., Altered river morphology in South Africa related to the Permian-Triassic extinction. Science289, 1740-1743 (2000).
[76] R. J. Twichett, Incompleteness of the Permian-Triassic fossil record, a consequence of productivity decline?Geological Journal 36, 341-353 (2001).
[77] B. U. Haq et al., Chronology of fluctuating sea levels since the triassic. Science 235, 1156-1167 (1987).
[78] B. U. Haq, S. R. Schutter, A Chronology of Paleozoic sea-level changes. Science 322, 64-68 (2008).
[79] A. Hallam, Phanerozoic Sea Level Changes (Columbia Univ. Press, New York, 1992).
[80] A. J. Boucot, J. Gray, A critique of Phanerozoic climatic models involving changes in the CO2 content of theatmosphere. Earth-Science Reviews 56, 1-159 (2001).
[81] A. J. Boucot et al., Reconstruction of the Phanerozoic Global Paleoclimate (Science Press, Beijing, 2009).
[82] R. A. Berner, The carbon cycle and CO2 over Phanerozoic time: the role of land plants. Phil. Trans. R. Soc.Lond. B. 353, 75-82 (1998).
[83] R. A. Berner, Z. Kothavala, GEOCARB III: A revised model of atmospheric CO2 over phanerozoic time.American Journal of Science 301, 182-204 (2001).
[84] R. A. Berner et al., Phanerozoic atmospheric oxygen. Annu. Rev. Earth Planet Sci. 31, 105-134 (2003).
[85] M. E. Raymo, Geochemical evidence supporting T. C. Chamberlin’s theory of glaciation. Geology 19, 344-347(1991).
[86] J. Veizer et al., 87S r/86S r, δ13C and δ18O evolution of Phanerozoic seawater. Chemical Geology 161, 59-88(1999).
[87] J. Veizer et al., Evidence for decoupling of atmospheric CO2 and global climate during the Phanerozoic eon.Nature 408, 698-701 (2000).
39
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
[88] H. M. Stoller, D. P. Schrag, Evidence for glacial control of rapid sea level changes in the Early Cretaceous.Science 272, 1771-1774 (1996).
[89] R. Riding, L. Liang, Geobiology of microbial carbonates: metazoan and seawater saturation state influenceson secular trends during the Phanerozoic. Palaeogeography, Palaeoclimatology, Palaeoecology 219, 101-115(2005).
[90] R. Riding, Phanerozoic reefal microbial carbonate abundance: comparisons with metazoan diversity, massextinction events, and seawater saturation state. Revista Espanola de Micropaleontologia 37, 23-38 (2005).
[91] R. Riding, Microbial carbonate abundance compared with fluctuations in metazoan diversity over geologicaltime. Sedimentary Geology 185, 229-238 (2006).
[92] W. Kiessling et al., Paleoreef maps: evaluation of a comprehensive database on Phanerozoic reefs. AAPGBulletin 83, 1552-1587 (1999).
[93] M. J. Benton, D. A. T. Harper, Basic Palaeontology (Addison Wesley Longman, Essex, England, 1997).
[94] J. J. Sepkoski, Jr., Rates of speciation in the fossil record. Philosophical Transactions of the Royal Society ofLondon, Series B 333, 315-326 (1998).
[95] K. W. Flessa, D. Jablonski, Declining Phanerozoic background extinction rates: effect of taxonomic structure.Nature 313, 216-218 (1985).
[96] L. Van Valen, How constant is extinction? Evol. Theory 7, 93-106 (1985).
[97] J. J. Sepkoski, Jr., A model of onshore-offshore change in faunal diversity. Paleobiology 17, 58-77 (1991).
[98] N. L. Gilinsky, Volatility and the Phanerozoic decline of background extinction intensity. Paleobiology 20,445-458 (1994).
[99] J. Alroy, Equilibrial diversity dynamics in north American mammals, in M. L. McKinney, J. A. Drake ed.,Biodiversity Dynamics: Turnover of Populations, Taxa, and Communities (Columbia Univ Press, New York,1998).
[100] D. M. Raup, J. J. Sepkoski, Jr., Mass extinctions in the marine fossil record. Science 215, 1501-1503 (1982).
[101] M. E. J. Newman, G. J. Eble, Decline in extinction rates and scale invariance in the fossil record. Paleobiology25, 434-439 (1999).
[102] D. Jablonski, Species selection: Theory and data. Annu. Rev. Ecol. Evol. Syst. 39, 501-524 (2008).
E-mail: [email protected]
Acknowledgements My warm thanks to Jinyi Li for valuable discussions. Supported by the FundamentalResearch Funds for the Central Universities.
40
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
Index
Section 1roadmap for the evolution of the genetic code
(the roadmap for short)initiation stage of the roadmap
expansion stage of the roadmap
ending stage of the roadmap
origin of the homochirality of lifeorigin of the genetic codeorigin of the three domains of lifebase pair of double helix DNAbase triplex of triplex DNArelative stability of base triplexbase substitutions in the roadmap
codon pair in the roadmap
pair connection in the roadmap
route duality in the roadmap
Route 0 − 3 of the roadmap
Hierarchy 1 − 4 of the roadmap
initial subset in the roadmap
recruitment order of codon pairs (#1 to #32)recruitment order of amino acids (No. 1 to No. 20)cooperative recruitment of codons and amino acidsnon-standard codons in the roadmap
origin of stop codonsorigin of tRNAsbranch node in the roadmap
leaf node in the roadmap
palindromic codon pairnon-palindromic codon paircomplementary single RNA strandsassignment scheme of anticodonspermutations of basescombinations of basesexplanation of number of canonical amino acidsexplanation of the wobble paring ruleexplanation of codon degeneracyminimum subset in the roadmap
facet of cube for routebiosynthetic families of amino acidsstandard genetic code tableGCAU genetic code tableclassification of aminoacyl-tRNA synthetaseschiral roadmapwinner-take-all competition between chiral roadmapssimulation of the origin of homochiralityexplanation of position specific GC contentexplanation of stop codon usageexplanation of variation of amino acid frequenciesorder R10/10 for amino acidsPhase I and II of amino acids
Section 2genomic codon distributioncodon intervals in genomecorrelation coefficient matrixdistance matrix by genomic codon distributions
tree of codons by genomic codon distributions
tree of species by genomic codon distributions
tree of eukaryotes by genomic codon distributions
rough invariance of genomic codon distributionrough periodicity of genomic codon distributiondeclining three-base-periodic fluctuationsamplitudes of genomic codon distributionorder of the average distribution heightsorigin order of domains by slope of regression lineevolutionary status of Virussimulation of genomic codon distributionsexplanation of validity of composition vector treeevolution spacefluctuation (coordination of the evolution space)route bias (coordination of the evolution space)hierarchy bias (coordination of the evolution space)leaf node bias
41
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
classification of genomic codon distributionsnon-dimensionalized evolution spacedistance matrix by the evolution space
tree of species by the evolution space
tree of taxa by the evolution space
tree of eukaryotes by the evolution space
three-domain tree of lifeeocyte tree of life
Section 3split and reconstruction schemeevolution of life at sequence levelevolution of life at species levelgenome size evolutionPhanerozoic biodiversity curvePhanerozoic climatic curvePhanerozoic eustatic curveexponential growth trend of the biodiversity curveexponential growth trend in genome size evolutionbiodiversity net fluctuation curveclimate gradient curveclimto-eustatic curvelogarithmic normal distribution of genome sizeslogarithmic mean of genome sizeslogarithmic standard deviation of genome sizesoriginal logarithmic mean of genome sizesfactor χsimulation of genome size evolution
interactions among earth’s spheresfive mass extinctionscontrol factors of mass extinctionsmultifactorial explanation of mass extinctionsHaq sea level curveHallam sea level curveclimate indicator curveBerner’s CO2 curvemarine 87S r/86S r curvemarine δ18O curveaverage after nondimensionalizationderivative curvesaggregation and dispersal of Pangaeaeustatic change at tectonic time scaleclimatic change at tectonic time scaleinstant response of biodiversity at tectonic time scaletectonic cause of mass extinctionsoutline of the biodiversity fluctuationstwo phases of P-Tr mass extinctionminor extinctionsradiations and rebounds of biodiversityreconstructed Phanerozoic biodiversity curveexplanation of the declining trend of origination rateexplanation of the declining trend of extinction ratesynchronicity between origination and extinctionconstraint on origination abilityredundant species created at the sequence levelstrategy of adaptation
42
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
Supplementary Figures
Fig 1A | Fig S1a1Fig 1B |
Fig 2A | Fig S2a1, S2a2, S2a3Fig 2B | Fig S2b1, S2b2, S2b3, S2b4, S2b5
Fig 2C | Fig S2c1, S2c2, S2c3Fig 2D | Fig S2d1
Fig 3A | Fig S3a1, S3a2, S3a3, S3a4, S3a5, S3a6Fig 3B | Fig S3b1, S3b2, S3b3
Fig 3C | Fig S3c1, S3c2Fig 3D | Fig S3d1, S3d2, S3d3, S3d4, S3d5, S3d6, S3d7, S3d8, S3d9, S3d10
Fig 4A | Fig S4a1, S4a2, S4a3, S4a4, S4a5, S4a6, S4a7, S4a8Fig 4B | Fig S4b1, S4b2, S4b3, S4b4, S4b5, S4b6, S4b7
Fig 4C | Fig S4c1Fig 4D | Fig S4d1, S4d2, S4d3
43
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
#11G → A
Y1 R1 * R11 Y11CCC
(stop,Ser)10Arg 6Pro
GGG
(stop,Ser)10Arg 6Pro
GGA
(stop,Ser)10Arg 6Pro
CCT
(stop,Ser)10Arg 6Pro
+
+
++
#14G → A
Y11R11*y11→ y11r11*R14 Y14TCC
(stop,Ser)10Arg 7Ser
AGG
(stop,Ser)10Arg 7Ser
TCC
(stop,Ser)10Arg 7Ser
CCT
(stop,Ser)10Arg 7Ser
GGA
(stop,Ser)10Arg 7Ser
AGA
(stop,Ser)10Arg 7Ser
TCT
(stop,Ser)10Arg 7Ser
4+
4+
4+
++
+
+
#1 initial subset
Y R * R1 Y1CCC
1Gly 6Pro
GGG
1Gly 6Pro
GGG
1Gly 6Pro
CCC
1Gly 6Pro
+
+
+
#3 G → A initial subset
Y1 R1 * R3 Y3CCC
1Gly 7Ser
GGG
1Gly 7Ser
AGG
1Gly 7Ser
TCC
1Gly 7Ser
++
+
+
#27G → A
Y4 R4 * R27 Y27CTC
20Lys (Thr)8Leu
GAG
20Lys (Thr)8Leu
GAA
20Lys (Thr)8Leu
CTT
20Lys (Thr)8Leu
+
+
++
#32G → A
Y23 R23 * R32 Y32CTT
(Asn)20Lys 17Phe
GAA
(Asn)20Lys 17Phe
AAA
(Asn)20Lys 17Phe
TTT
(Asn)20Lys 17Phe
++
+
+
#4 G → A initial subset
Y1 R1 * R4 Y4CCC
3Glu (Thr)8Leu
GGG
3Glu (Thr)8Leu
GAG
3Glu (Thr)8Leu
CTC
3Glu (Thr)8Leu
+
++
+
#23G → A
Y4 R4 * R23 Y23CTC
3Glu 17Phe
GAG
3Glu 17Phe
AAG
3Glu 17Phe
TTC
3Glu 17Phe
++
+
+
#5 G → A initial subset
Y2 R2 *y2 → y2 r2 *R5 Y5 CCG
4Asp 5Val
GGC
4Asp 5Val
CCG
4Asp 5Val
GCC
4Asp 5Val
CGG
4Asp 5Val
CAG
4Asp 5Val
GTC
4Asp 5Val
4+
4+
+
+
++
+
#26G → A
Y5 R5 *y5 → y5 r5 *R26 Y26CTG
19Asn 5Val
GAC
19Asn 5Val
CTG
19Asn 5Val
GTC
19Asn 5Val
CAG
19Asn 5Val
CAA
19Asn 5Val
GTT
19Asn 5Val
4+
4+
+
+
+
++
#2 G → C initial subset
Y1 R1 * R2 Y2CCC
1Gly 2Ala
GGG
1Gly 2Ala
CGG
1Gly 2Ala
GCC
1Gly 2Ala
4+
+
+
#8 G → A
Y2 R2 *y2 → y2 r2 *R8 Y8 CCG
7Ser 2Ala
GGC
7Ser 2Ala
CCG
7Ser 2Ala
GCC
7Ser 2Ala
CGG
7Ser 2Ala
CGA
7Ser 2Ala
GCT
7Ser 2Ala
4+
4+
+
+
+
++
#21G → A
Y6 R6 *y6 → y6 r6 *R21 Y21CCA
4Asp 15Ile
GGT
4Asp 15Ile
CCA
4Asp 15Ile
ACC
4Asp 15Ile
TGG
4Asp 15Ile
TAG
4Asp 15Ile
ATC
4Asp 15Ile
4+
4+
−
+
++
+
#30G → A
Y21R21*y21→ y21r21*R30 Y30CTA
19Asn 15Ile
GAT
19Asn 15Ile
CTA
19Asn 15Ile
ATC
19Asn 15Ile
TAG
19Asn 15Ile
TAA
19Asn 15Ile
ATT
19Asn 15Ile
4+
4+
−
+
+
++
#6 C → T initial subset
Y2 R2 *y2 → y2 r2 *R6 Y6 CCG
1Gly 9Thr
GGC
1Gly 9Thr
CCG
1Gly 9Thr
GCC
1Gly 9Thr
CGG
1Gly 9Thr
TGG
1Gly 9Thr
ACC
1Gly 9Thr
4+
4+
+
++
+
+
#17G → A
Y6 R6 *y6 → y6 r6 *R17 Y17CCA
7Ser 9Thr
GGT
7Ser 9Thr
CCA
7Ser 9Thr
ACC
7Ser 9Thr
TGG
7Ser 9Thr
TGA
7Ser 9Thr
ACT
7Ser 9Thr
4+
4+
−
+
+
++
#16G → A
Y7 R7 * R16 Y16CGC
9Thr 10Arg
GCG
9Thr 10Arg
GCA
9Thr 10Arg
CGT
9Thr 10Arg
+
+
++
#18G → A
Y16R16*y16→ y16r16*R18 Y18TGC
9Thr 11Cys
ACG
9Thr 11Cys
TGC
9Thr 11Cys
CGT
9Thr 11Cys
GCA
9Thr 11Cys
ACA
9Thr 11Cys
TGT
9Thr 11Cys
4+
+
4+
++
+
+
#7 G → C
Y1 R1 * R7 Y7CCC
2Ala 10Arg
GGG
2Ala 10Arg
GCG
2Ala 10Arg
CGC
2Ala 10Arg
+
4+
+
#9 G → A
Y7 R7 * R9 Y9CGC
2Ala 11Cys
GCG
2Ala 11Cys
ACG
2Ala 11Cys
TGC
2Ala 11Cys
++
+
+
#22G → A
Y19 R19 * R22 Y22CAC
16Met 13His
GTG
16Met 13His
GTA
16Met 13His
CAT
16Met 13His
+
+
++
#29G → A
Y24R24*y24→ y24r24*R29 Y29CAT
(Met,stop)15Ile 18Tyr
GTA
(Met,stop)15Ile 18Tyr
CAT
(Met,stop)15Ile 18Tyr
TAC
(Met,stop)15Ile 18Tyr
ATG
(Met,stop)15Ile 18Tyr
ATA
(Met,stop)15Ile 18Tyr
TAT
(Met,stop)15Ile 18Tyr
4+
−
4+
+
+
++
#19C → T
Y7 R7 * R19 Y19CGC
5Val 13His
GCG
5Val 13His
GTG
5Val 13His
CAC
5Val 13His
+
++
+
#24G → A
Y19 R19 * R24 Y24CAC
5Val 18Tyr
GTG
5Val 18Tyr
ATG
5Val 18Tyr
TAC
5Val 18Tyr
++
+
+
#20G → A
Y10R10*y10→ y10r10*R20 Y20GCC
14Gln (Thr,Ser)8Leu
CGG
14Gln (Thr,Ser)8Leu
GCC
14Gln (Thr,Ser)8Leu
CCG
14Gln (Thr,Ser)8Leu
GGC
14Gln (Thr,Ser)8Leu
GAC
14Gln (Thr,Ser)8Leu
CTG
14Gln (Thr,Ser)8Leu
+
4+
4+
+
++
+
#28G → A
Y20R20*y20→ y20r20*R28 Y28GTC
14Gln 8Leu
CAG
14Gln 8Leu
GTC
14Gln 8Leu
CTG
14Gln 8Leu
GAC
14Gln 8Leu
AAC
14Gln 8Leu
TTG
14Gln 8Leu
+
4+
4+
++
+
+
#10G → C
Y1 R1 * R10 Y10CCC
(stop)10Arg 6Pro
GGG
(stop)10Arg 6Pro
GGC
(stop)10Arg 6Pro
CCG
(stop)10Arg 6Pro
+
+
4+
#13G → A
Y10R10*y10→ y10r10*R13 Y13GCC
10Arg 7Ser
CGG
10Arg 7Ser
GCC
10Arg 7Ser
CCG
10Arg 7Ser
GGC
10Arg 7Ser
AGC
10Arg 7Ser
TCG
10Arg 7Ser
+
4+
4+
++
+
+
#25G → A
Y12R12*y12→ y12r12*R25 Y25ACC
(Gln)stop (Thr)8Leu
TGG
(Gln)stop (Thr)8Leu
ACC
(Gln)stop (Thr)8Leu
CCA
(Gln)stop (Thr)8Leu
GGT
(Gln)stop (Thr)8Leu
GAT
(Gln)stop (Thr)8Leu
CTA
(Gln)stop (Thr)8Leu
−
4+
4+
+
++
+
#31G → A
Y25R25*y25→ y25r25*R31 Y31ATC
(Gln)stop 8Leu
TAG
(Gln)stop 8Leu
ATC
(Gln)stop 8Leu
CTA
(Gln)stop 8Leu
GAT
(Gln)stop 8Leu
AAT
(Gln)stop 8Leu
TTA
(Gln)stop 8Leu
−
4+
4+
++
+
+
#12C → T
Y10R10*y10→ y10r10*R12 Y12GCC
12Trp 6Pro
CGG
12Trp 6Pro
GCC
12Trp 6Pro
CCG
12Trp 6Pro
GGC
12Trp 6Pro
GGT
12Trp 6Pro
CCA
12Trp 6Pro
+
4+
4+
+
+
++
#15G → A
Y12R12*y12→ y12r12*R15 Y15ACC
(Trp,Cys,Sel)stop 7Ser
TGG
(Trp,Cys,Sel)stop 7Ser
ACC
(Trp,Cys,Sel)stop 7Ser
CCA
(Trp,Cys,Sel)stop 7Ser
GGT
(Trp,Cys,Sel)stop 7Ser
AGT
(Trp,Cys,Sel)stop 7Ser
TCA
(Trp,Cys,Sel)stop 7Ser
−
4+
4+
++
+
+
Pro
Gly
Arg(Ser)
Glu
Leu(T
hr)
Ala
Gly
Val
Thr
Arg
Ala
Thr
Val
His
Arg
Pro
Gln,Leu
(Trp)
Phe
Tyr
Ile
Leu(Gln)
Gly
Pro
Hierarchy 1 Hierarchy 2 Hierarchy 3 Hierarchy 4
Ro
ute
0
Ro
ute
1
Ro
ute
2
Ro
ute
3
Ser
Asp
Ser
Cys
Leu(G
ln,T
hr)
Ser
Lys
Asn
(Met)
Leu(G
ln)
Fig 1A The roadmap for the evolution of the genetic code. The 64 codons formed from base substitutions in triplexDNAs are in red. Only three-base-length segments of the triplex DNAs are shown explicitly; the whole lengthright-handed triplex DNAs are indicated in Fig 1B. In each position #n, the #n codon pair on Rn and Yn is in red.The relative stabilities of the base triplexes (-, +, ++, 4+) are written to the right of the base triplexes, where theincreasing relative stabilities of base triplexes in base substitutions are indicated in green. Each triplex DNA is denotedby three arrows, whose directions are from 5’ to 3’. The YR ∗ R triplex DNAs are in pink, and the YR ∗ Y triplex DNAsin azure. The recruitment order of codon pairs are from #1 to #32, and the recruitment order of the 20 amino acids areto the left of them respectively. Non-standard genetic codes are indicated by brackets beside the corresponding aminoacids. The Route 0 − 3 and Hierarchy 1 − 4 are indicated to the right of and below the roadmap respectively. Theevolution of the genetic code are denoted by black arrows, beside which pair connections are indicated by certain aminoacids. Refer to an example in Fig 1B to understand details of the roadmap; refer to Fig S1a1 to understand the criticalrole of relative stabilities of base triplexes in achieving the real genetic code; refer to Fig 2A to see the origin of tRNAs;refer to Fig 2B to see the coherent relationship between the recruitment orders of codons and amino acids; refer to Fig2C to see the codon degeneracy in the symmetric roadmap; and refer to Fig 2D to see the origin of homochirality of life.
44
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
#11CG*G → CG*A
CCC
GGG
GGA
+
+
++
#14CG*G → CG*A
CCT
GGA
AGA
+++
+
#1
CCC
GGG
GGG
+
+
+
#3 CG*G → CG*A
CCC
GGG
AGG
+++
+
#27CG*G → CG*A
CTC
GAG
GAA
+
+
++
#32CG*G → CG*A
CTT
GAA
AAA
+++
+
#4 CG*G → CG*A
CCC
GGG
GAG
+
+++
#23CG*G → CG*A
CTC
GAG
AAG
+++
+
#5 CG*G → CG*A
GCC
CGG
CAG
+
+++
#26CG*G → CG*A
GTC
CAG
CAA
+
+
++
#2 CG*G → CG*C
CCC
GGG
CGG
4++
+
#8 CG*G → CG*A
GCC
CGG
CGA
+
+
++
#21CG*G → CG*A
ACC
TGG
TAG
+
+++
#30CG*G → CG*A
ATC
TAG
TAA
+
+
++
#6 GC*C → GC*T
GCC
CGG
TGG
+++
+
#17CG*G → CG*A
ACC
TGG
TGA
+
+
++
#16CG*G → CG*A
CGC
GCG
GCA
+
+
++
#18CG*G → CG*A
CGT
GCA
ACA
+++
+
#7 CG*G → CG*C
CCC
GGG
GCG
+
4++
#9 CG*G → CG*A
CGC
GCG
ACG
+++
+
#22CG*G → CG*A
CAC
GTG
GTA
+
+
++
#29CG*G → CG*A
TAC
ATG
ATA
+
+
++
#19GC*C → GC*T
CGC
GCG
GTG
+
+++
#24CG*G → CG*A
CAC
GTG
ATG
+++
+
#20CG*G → CG*A
CCG
GGC
GAC
+
+++
#28CG*G → CG*A
CTG
GAC
AAC
+++
+
#10CG*G → CG*C
CCC
GGG
GGC
+
+
4+
#13CG*G → CG*A
CCG
GGC
AGC
+++
+
#25CG*G → CG*A
CCA
GGT
GAT
+
+++
#31CG*G → CG*A
CTA
GAT
AAT
+++
+
#12GC*C → GC*T
CCG
GGC
GGT
+
+
++
#15CG*G → CG*A
CCA
GGT
AGT
+++
+
Hierarchy 1 Hierarchy 2 Hierarchy 3 Hierarchy 4
Ro
ute
0
Ro
ute
1
Ro
ute
2
Ro
ute
3
CCC
GGG
GGG
+
+
+
CCT
GGA
GGA
+ +
+
CCC
GGG
GGG
+
+
+
CCC
GGG
GGG
+ +
+
CTC
GAG
GAG
+
+
+
CTT
GAA
GAA
+ +
+
CCC
GGG
GGG
+
+ +
CTC
GAG
GAG
+ +
+
GCC
CGG
CGG
+
+ +
GTC
CAG
CAG
+
+
+
CCC
GGG
GGG
+ +
+
GCC
CGG
CGG
+
+
+
ACC
TGG
TGG
+
+ +
ATC
TAG
TAG
+
+
+
GCC
CGG
CGG
+ +
+
ACC
TGG
TGG
+
+
+
CGC
GCG
GCG
+
+
+
CGT
GCA
GCA
+ +
+
CCC
GGG
GGG
+
+ +
CGC
GCG
GCG
+ +
+
CAC
GTG
GTG
+
+
+
TAC
ATG
ATG
+
+
+
CGC
GCG
GCG
+
+ +
CAC
GTG
GTG
+ +
+
CCG
GGC
GGC
+
+ +
CTG
GAC
GAC
+ +
+
CCC
GGG
GGG
+
+
+
CCG
GGC
GGC
+ +
+
CCA
GGT
GGT
+
+ +
CTA
GAT
GAT
+ +
+
CCG
GGC
GGC
+
+
+
CCA
GGT
GGT
+ +
+
Hierarchy 1 Hierarchy 2 Hierarchy 3 Hierarchy 4
Ro
ute
0
Ro
ute
1
Ro
ute
2
Ro
ute
3
Fig S1a1 Explanation of the increasing relative stabilities of base triplexes in base substitutions along theroadmap Fig 1A. The base substitutions in the roadmap occur when the relative stabilities of base triplexesincrease. The roadmap is the best result to avoid the unstable base triplexes. So, the universal genetic codeis a narrow choice by the relative stabilities of base triplexes. The relative stability increases from (+) of thebase triplex CG ∗G to (4+) of the base triplex CG ∗C at #2, #7 and #10 that initiates Route 1− 3 respectively.GC ∗ C (+) changes to GC ∗ T (++) at #6, #19 and #12, and CG ∗ G (+) changes to CG ∗ A (++) at otherpositions in the roadmap.
45
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
#1 #7 #19 #24 #29G → C C → T G → A G → A
step by step
(The stability of basetriplex increases from
(The stability of basetriplex increases from
(The stability of basetriplex increases from
(The stability of basetriplex increases from
CG*G + to CG*C 4+ ) GC*C + to GC*T ++ ) CG*G + to CG*A ++ ) CG*G + to GC*A ++ )
R
GGG
combine
combine
seperate
re−combine
Y R
CCC
GGG
Y1 R1
CCC
GGG
Y24R24
CAT
GTA
y24 r24
TAC
ATG
Y7 R7
CGC
GCG
Y19R19
CAC
GTG
Y R *R1
CCC
GGG
GGG
+ + +
Y1 R1 *R7
CCC
GGG
GCG
+ 4++
Y24R24*y24
CAT
GTA
CAT
4+− 4+
y24 r24 *R29
TAC
ATG
ATA
+ + ++
Y7 R7 *R19
CGC
GCG
GTG
+ +++
Y19R19*R24
CAC
GTG
ATG
+++ +
Y R R1
andCCC
GGG
GGG
Y1 R1 R7
andCCC
GGG
GCG
Y24 R24 y24
andCAT
GTA
CAT
y24 r24 R29
andTAC
ATG
ATA
Y7 R7 R19
andCGC
GCG
GTG
Y19 R19 R24
andCAC
GTG
ATG
R1 Y1
GGG
CCC
R7 Y7
GCG
CGC
y24r24
CAT
GTA
R29 Y29
ATA
TAT
R19 Y19
GTG
CAC
R24 Y24
ATG
TAC
R
5’
3’
G
G
G
G
G
G
G
G
G
G
G
G
G
Y
3’
5’
C
C
C
C
C
C
C
C
CCC
G GG
G
G
G5’
3’ 5’
3’5’3’
Y R R1
CCC
G GG
5’
3’
3’5’Y R
CCC
G GG
G
G
C5’
3’ 5’
3’5’3’
Y1 R1 R7
CCC
G GG
5’
3’
3’5’Y1 R1
CGC
G CG
G
G
T5’
3’ 5’
3’5’3’
Y7 R7 R19
CGC
G CG
5’
3’
3’5’Y7 R7
CAC
G TG
G
A
T5’
3’ 5’
3’5’3’
Y19R19R24
CAC
G TG
5’
3’
3’5’Y19R19
TAC
A TG
T
C
A5’
3’ 3’
3’5’5’
Y24R24y24
TAC
A TG
5’
3’
3’5’Y24R24
CAT
G TA
A
A
T5’
3’ 5’
3’5’3’
y24r24R29
CAT
G TA
5’
3’
3’5’y24r24
Fig 1B A detailed description of the roadmap in the right-handed triplex DNA picture, taking for examplefrom #1 to #29. The evolution of the genetic code from #1, to #7, to #19, to #24, and at last to #29 areexplained in detail in the upper boxes, and the corresponding right-handed single-stranded, double-strandedand triple-stranded DNAs are shown in the lower boxes, respectively.
46
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
#11
tRNA t7 tRNA t10
Y11R11*Yt11 R
t11 y11r11*y
t11 r
t11
7Ser 10Arg
TCC
AGG
UCC
AGG
CCT
GGA
CCU
GGA
#14
#1
tRNA t1
Y1 R1 * Yt1 R
t1
1Gly
CCC
GGG
CCC
GGG
#3 #27
tRNA t20 tRNA t17
Y23R23*Yt23 R
t23 y23r23*y
t23 r
t23
20Lys 17Phe
CTT
GAA
CUU
GAA
TTC
AAG
UUC
AAG
#32
#4
tRNA t3
Y4 R4 * Yt4 R
t4
3Glu
CTC
GAG
CUC
GAG
#23
#5
tRNA t4
Y5 R5 *Yt5 R
t5 y5 r5 *y
t5 r
t5
4Asp
CTG
GAC
CUG
GAC
GTC
CAG
GUC
CAG
#26
#2
Y2 R2 *Yt2 R
t2 y2 r2 *y
t2 r
t2
CCG
GGC
CCG
GGC
GCC
CGG
GCC
CGG
#8 #21
tRNA t19
Y21R21*Yt21 R
t21 y21r21*Y
t30 R
t30
19Asn
CTA
GAT
CUA
GAU
ATC
TAG
AUU
UAA
#30
#6
Y6 R6 *Yt6 R
t6 y6 r6 *y
t6 r
t6
CCA
GGT
CCA
GGU
ACC
TGG
ACC
UGG
#17
#16
tRNA t11 tRNA t9
Y16R16*Yt16 R
t16 y16r16*y
t16 r
t16
11Cys 9Thr
TGC
ACG
UGC
ACG
CGT
GCA
CGU
GCA
#18
#7
tRNA t2
Y7 R7 * Yt7 R
t7
2Ala
CGC
GCG
CGC
GCG
#9 #22
tRNA t16 tRNA t18
Y24R24*Yt24 R
t24 y24r24*y
t24 r
t24
16Met 18Tyr
CAT
GTA
CAU
GUA
TAC
ATG
UAC
AUG
#29
#19
tRNA t13
Y19 R19 * Yt19 R
t19
13His
CAC
GTG
CAC
GUG
#24
#20
tRNA t5 tRNA t14
Y20R20*Yt20 R
t20 y20r20*y
t20 r
t20
5Val 14Gln
GTC
CAG
GUC
CAG
CTG
GAC
CUG
GAC
#28
#10
tRNA t6
Y10R10*Yt10 R
t10 y10r10*y
t10 r
t10
6Pro
GCC
CGG
GCC
CGG
CCG
GGC
CCG
GGC
#13 #25
tRNA t15 tRNA t8
Y25R25*Yt25 R
t25 y25r25*y
t25 r
t25
15Ile 8Leu
ATC
TAG
AUC
UAG
CTA
GAT
CUA
GAU
#31
#12
tRNA t12
Y12R12*Yt12 R
t12 y12r12*y
t12 r
t12
12Trp
ACC
TGG
ACC
UGG
CCA
GGT
CCA
GGU
#15
Wobble pairing G:U,C
Wobble pairing U:G,A
Hierarchy 1 Hierarchy 2 Hierarchy 3 Hierarchy 4
Ro
ute
0
Ro
ute
1
Ro
ute
2
Ro
ute
3
Fig 2A The tRNAs originated along the roadmap are able to transport all the canonical amino acids. Eachtriplex nucleic acid YR ∗ Yt or YR ∗ Rt consists of two DNA strands Y and R and one RNA strand Yt or Rt.There are two possible ways for each codon pair to generate two complementary RNA strands (Yt and Rt),which consequently assemble into a tRNA (Fig S2a3).
47
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
CCC
CUC UCC CCU
CGC GCC CCG
CAC ACC CCA
GUC CUG
UGC CGU
AUC(AUU) CUA
UAC CAU
CUU UUC
GGG
GAG GGA AGG
GCG GGC CGG
GUG GGU UGG
GAC CAG
GCA ACG
GAU UAG
GUA AUG
AAG GAA
t1
t3 t10
t2
t12
t4 t14
t9
t19
t16
t20
t7
t6
t13
t5
t11
t15 t8
t18
t17
Palindromic anti−codon Non−palindromic anti−codon
Y s
tran
dR
stra
nd
Stra
nd
po
sitio
nFig S2a1 Single-strand RNA pairs at the branch nodes in the roadmap Fig 1A can form tRNAs: the tRNAs t1,t10, t3, t17, t2, t9, t13, t18 correspond to the codon pairs in Route 0 and Route 2, and the tRNAs t7, t20, t11,t16 correspond to the reverse codon pairs (considering the reverse operation) in Route 0 and Route 2 (Fig 2A).The codon pairs are reverse to each other between Route 1 and Route 3. The tRNAs t4, t19, t6, t5, t14, t12,t15, t8 correspond to codon pairs or reverse codon pairs in Route 1 and Route 3 (Fig 2A). In total, there are20 tRNAs such obtained that correspond to the codon pairs or their reverse codon pairs at the branch nodes.The anti-codons concerned are arranged in the present figure by comparing between Y strand anti-codonsand R strand anti-codons, as well as by comparing between palindromic anti-codons and non-palindromicanti-codons, and the rows in the present figure indicate the types of genomic codon distributions in Fig S3d1.
48
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
GGG 1Gly t1
GGA 1Gly
GGC 1Gly
GGU 1Gly
GCG 2Alat2
GCA 2Ala
GCC 2Ala
GCU 2Ala
GAG 3Glu t3
GAA 3Glu
GAC 4Asp t4
GAU 4Asp
GUG 5Val
GUA 5Val
GUC 5Valt5
GUU 5Val
CGG10Arg
CGA10Arg
CGC10Arg
CGU10Arg
CCG 6Prot6
CCA 6Pro
CCC 6Pro
CCU 6Pro
CAG14Gln t14
CAA14Gln
CAC13His t13
CAU13His
CUG 8Leu
CUA 8Leut8
CUC 8Leu
CUU 8Leu
AGG10Arg t10
AGA10Arg
AGC 7Ser
AGU 7Ser
ACG 9Thrt9
ACA 9Thr
ACC 9Thr
ACU 9Thr
AAG20Lys t20
AAA20Lys
AAC19Asn
AAU19Asnt19
AUG 16Mett16
AUA 15Ile
AUC 15Ilet15
AUU 15Ile
UGG12Trp t12
UGA stop
UGC11Cys t11
UGU11Cys
UCG 7Ser
UCA 7Ser
UCC 7Sert7
UCU 7Ser
UAG stop
UAA stop
UAC18Tyr t18
UAU18Tyr
UUG 8Leu
UUA 8Leu
UUC 17Phet17
UUU 17Phe
Fig S2a2 The 20 tRNAs t1-t20 (Fig 2A, S2a1) are able to carry all the 20 canonical amino acids according tothe proper codon assignment in the present figure, where the blue and red codons denote the R- and Y-strandcodons in Fig S2a1 respectively, and the green lines indicate the codon pairs.
49
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
1Gly t1 Y
t1 R
t1 CCC GGG
5’ 3’
2Ala t2 Y
t7 R
t7 CGC GCG
5’ 3’
3Glu t3 Y
t4 R
t4 CUC GAG
5’ 3’
4Asp t4 y
t5 r
t5 GUC GAC
5’ 3’
5Val t5 R
t20 Y
t20GAC GUC
5’ 3’
6Pro t6 rt10 y
t10CGG CCG
5’ 3’
7Ser t7 R
t11 Y
t11GGA UCC
5’ 3’
8Leu t8 rt25 y
t25UAG CUA
5’ 3’
9Thr t9 y
t16 r
t16CGU ACG
5’ 3’
10Arg t10y
t11 r
t11CCU AGG
5’ 3’
11Cys t11R
t16 Y
t16GCA UGC
5’ 3’
12Trp t12y
t12 r
t12CCA UGG
5’ 3’
13His t13R
t19 Y
t19GUG CAC
5’ 3’
14Gln t14y
t20 r
t20CUG CAG
5’ 3’
15Ile t15R
t25 Y
t25GAU AUC
5’ 3’
16Met t16Y
t24 R
t24CAU AUG
5’ 3’
17Phe t17rt23 y
t23GAA UUC
5’ 3’
18Tyr t18rt24 y
t24GUA UAC
5’ 3’
19Asn t19Y
t30 R
t30AUU AAU
5’ 3’
20Lys t20Y
t23 R
t23CUU AAG
5’ 3’
anti−codon
5’
3’
3’
5’
an
ti−co
do
n
5’
3’
3’
5’
an
ti−co
do
n
5’ 3’
anti−codon
Fig S2a3 The tRNAs t1-t20 with anti-codons (Fig 2A, S2a1, S2a2) are listed here to carry the amino acidsfrom No.1 to No.20 respectively. The two complimentary single-stranded RNAs for each tRNA join together,and fold into a cloverleaf shape by taking advantage of the complementarity between the two strands. Thejoining position of the two strands is near to the 3’ side of the anti-codon loop, which agrees with the positionof introns in tRNA genes in observations. The additional codon in the complementary strand might be locatedaround the variable loop, despite possibly great changes in this loop. This may relate to the paracodon in thetRNAs.
50
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
#1110Arg
6Pro
#1410Arg
7Ser
#1 1Gly
6Pro
#3 1Gly
7Ser
#2720Lys
8Leu
#3220Lys
17Phe
#4 3Glu
8Leu
#23 3Glu
17Phe
#5 4Asp
5Val
#2619Asn
5Val
#2 1Gly
2Ala
#8 7Ser
2Ala
#21 4Asp
15Ile
#3019Asn
15Ile
#6 1Gly
9Thr
#17 7Ser
9Thr
#16 9Thr
10Arg
#18 9Thr
11Cys
#7 2Ala
10Arg
#9 2Ala
11Cys
#2216Met
13His
#2915Ile
18Tyr
#19 5Val
13His
#24 5Val
18Tyr
#2014Gln
8Leu
#2814Gln
8Leu
#1010Arg
6Pro
#1310Arg
7Ser
#25 stop
8Leu
#31 stop
8Leu
#1212Trp
6Pro
#15 stop
7Ser
Ala Pro Ser Thr Val Leu
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Initial subset
Hierarchy 1 Hierarchy 2 Hierarchy 3 Hierarchy 4
Ro
ute
0
Ro
ute
1
Ro
ute
2
Ro
ute
3
Fig 2B Cooperative recruitment of codons and amino acids along the roadmap. The codon pairs are plottedfrom left to right according to their recruitment order. The initial subset plays a crucial role in the expansionof the genetic code. The 6 biosynthetic families of the amino acid are distinguished by different colours.
51
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
GGG GCC GAG GAC GUC CCC UCC CUC ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#1
GGC GCC GAG GAC GUC CCG UCC CUC ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#2
GGA GCG GAG GAC GUC CCG UCC CUC ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#3
GGU GCG GAG GAC GUC CCG AGC CUC ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#4
GGU GCG GAA GAC GUC CCG AGC CUG ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#5
GGU GCG GAA GAU GUG CCG AGC CUG ACC CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#6
GGU GCG GAA GAU GUG CCG AGC CUG ACG CGC UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#7
GGU GCU GAA GAU GUG CCG AGC CUG ACG CGG UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#8
GGU GCA GAA GAU GUG CCG UCG CUG ACG CGG UGC UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#9
GGU GCA GAA GAU GUG CCG UCG CUG ACG CGG UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#10
GGU GCA GAA GAU GUG CCU UCG CUG ACG AGG UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#11
GGU GCA GAA GAU GUG CCA UCG CUG ACG CGA UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#12
GGU GCA GAA GAU GUG CCA UCG CUG ACG CGA UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#13
GGU GCA GAA GAU GUG CCA UCU CUG ACG AGA UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#14
GGU GCA GAA GAU GUG CCA UCA CUG ACG CGU UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UGA#15
GGU GCA GAA GAU GUG CCA AGU CUG ACG CGU UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UAG#16
GGU GCA GAA GAU GUG CCA AGU CUG ACU CGU UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UAG#17
GGU GCA GAA GAU GUG CCA AGU CUG ACA CGU UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UAG#18
GGU GCA GAA GAU GUG CCA AGU CUG ACA CGU UGU UGG CAC CAG AUC AUG UUC UAC AAC AAG UAG#19
GGU GCA GAA GAU GUA CCA AGU CUG ACA CGU UGU UGG CAU CAG AUC AUG UUC UAC AAC AAG UAG#20
GGU GCA GAA GAU GUA CCA AGU CUA ACA CGU UGU UGG CAU CAA AUC AUG UUC UAC AAC AAG UAG#21
GGU GCA GAA GAU GUA CCA AGU CUA ACA CGU UGU UGG CAU CAA AUA AUG UUC UAC AAC AAG UAG#22
GGU GCA GAA GAU GUA CCA AGU CUA ACA CGU UGU UGG CAU CAA AUA AUG UUC UAC AAC AAG UAG#23
GGU GCA GAA GAU GUA CCA AGU CUA ACA CGU UGU UGG CAU CAA AUA AUG UUU UAC AAC AAG UAG#24
GGU GCA GAA GAU GUU CCA AGU CUA ACA CGU UGU UGG CAU CAA AUA AUG UUU UAU AAC AAG UAG#25
GGU GCA GAA GAU GUU CCA AGU CUU ACA CGU UGU UGG CAU CAA AUA AUG UUU UAU AAC AAG UAA#26
GGU GCA GAA GAU GUU CCA AGU CUU ACA CGU UGU UGG CAU CAA AUA AUG UUU UAU AAU AAG UAA#27
GGU GCA GAA GAU GUU CCA AGU UUG ACA CGU UGU UGG CAU CAA AUA AUG UUU UAU AAU AAA UAA#28
GGU GCA GAA GAU GUU CCA AGU UUA ACA CGU UGU UGG CAU CAA AUA AUG UUU UAU AAU AAA UAA#29
GGU GCA GAA GAU GUU CCA AGU UUA ACA CGU UGU UGG CAU CAA AUU AUG UUU UAU AAU AAA UAA#30
GGU GCA GAA GAU GUU CCA AGU UUA ACA CGU UGU UGG CAU CAA AUU AUG UUU UAU AAU AAA UAA#31
GGU GCA GAA GAU GUU CCA AGU UUA ACA CGU UGU UGG CAU CAA AUU AUG UUU UAU AAU AAA UAA#32
1Gly 2Ala 3Glu 4Asp 5Val 6Pro 7Ser 8Leu 9Thr 10Arg 11Cys 12Trp 13His 14Gln 15Ile 16Met 17Phe 18Tyr 19Asn 20Lys stop
Fig S2b1 The recruitment orders of the codon pairs and the amino acids in the roadmap Fig 1A are listedin the left column and top row respectively. The codons that encode each amino acid are listed in columnsrespectively, where the row positions for the degenerate codons are determined by the corresponding codonsin the first column of codon pairs from #1 to #32 (blue codons), and the other vacant positions in eachcolumn are filled by the nearby degenerate codons respectively. These blue codons are aligned from up-leftto down-right, which shows that the later the codon pairs recruited, the later the corresponding amino acidsrecruited.
52
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0.3 0.4 0.5 0.60
0.2
0.4
0.6
0.8
1
GC content
co
do
n p
ositio
n−
sp
ecific
GC
co
nte
nt
1st codon position
2nd codon position
3rd codon position
Fig S2b2 It is observed that the GC content in the third codon position decreases more rapidly than thatin the first and second base positions when the total GC content decreases, and the GC content in the firstcodon position is generally greater than that in the second codon position. Such a pattern of variation of thebase position specific GC contents in observations can be explained by the roadmap. According to the ordersof degenerate codons for each amino acid in Fig S2b1, the total GC content and the GC contents at 1st-,2nd-, 3rd-codon-position at stages from #1 to #32 can be obtained by counting and calculating in each rowof Fig S2b1, respective. Consequently, the relationships between the total GC content and the base positionspecific GC contents are obtained respectively. Even some detailed features in observation are reproducedin the present figure based on the roadmap Fig 1A, such as the closer distance between 1st and 2nd codonposition GC content at low total GC content side. Such an explanation indicates that the recruitment orderof the codons from #1 to #32 obtained based on the roadmap Fig 1A is considerably reasonable, and henceconfirms the validity of the roadmap theory.
53
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0.3 0.4 0.5 0.60
0.2
0.4
0.6
0.8
1
GC content
fre
qu
en
cie
s o
f sto
p c
od
on
s
UGA
UAG
UAA
Fig S2b3 It is observed that the stop codons vary in certain pattern when the total GC content decreases basedon genomic data. Namely, along the evolutionary direction of the decreasing total GC content, the first stopcodon UGA in the roadmap decrease greatly, the second codon UAG in the roadmap decrease slightly, and thethird stop codon UAA increase greatly. Such a pattern in observations can be explained by the roadmap. Therelationships between the total GC content and the frequencies of the three stop codons are obtained basedon the roadmap Fig 1A, respectively, by counting and calculating similarly in Fig S2b2, where the variationranges for the three stop codons have been adjusted according to the observations. A detailed feature inobservations, namely the respective downward and upward leaps of UGA and UAA at the middle total GCcontent, is reproduced in the present figure based on the counts in Fig S2b1. Such an explanation indicatesthat the recruitment order of the three stop codons obtained based on the roadmap Fig 1A is considerablyreasonable, and hence confirms the validity of the roadmap theory.
54
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
variation o
f th
e a
min
o a
cid
fre
quencie
s for
the s
pecie
s in R
10/1
0 e
volu
tionary
ord
er
Gly Ala Glu Asp Val Pro Ser Leu Thr Arg Cys Trp His Gln Ile Met Phe Tyr Asn Lys
Fig S2b4 Explanation of the variation of the amino acid frequencies. The 20 amino acids are arranged inthe recruitment order in the roadmap Fig 1A. The 20 amino acid frequencies for each of the 803 species areobtained respectively based on the genomic data in NCBI. And the 803 amino acid frequencies (green dots)for each of the 20 amino acids are all arranged properly in the R10/10 order, respectively. The variation trendof the amino acid frequencies for each of the 20 amino acids is obtained by the regression line (denoted inred). Generally speaking, the variation trends for the earlier amino acids tend to decrease, and the variationtrends for the latecomers to increase (Fig S2b4).
55
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
−12
−6
0
6
12
slo
pe o
f th
e v
ariation o
f am
ino a
cid
fre
quencie
s (
10
−5)
Gly Ala Glu Asp Val Pro Ser Leu Thr Arg Cys Trp His Gln Ile Met Phe Tyr Asn Lys
Fig S2b5 Increasing variation rates of the amino acid frequencies. The slope of the regression line of variationof amino acid frequencies for each amino acid is obtained based on the data in Fig S2b4. It is observed inthe present figure that the slopes vary from negative to positive in general according to the recruitment orderof amino acids in the roadmap. Such a natural result indicates that the recruitment order of the amino acidsfrom No.1 to No.20 obtained based on the roadmap Fig 1A is considerably reasonable, and hence confirmsthe validity of the roadmap theory.
56
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
G
T
G
C
A
C
#19
Val His
IA/m IIA/M<G> <C>
N_branch
G
T
A
C
A
T
#22
Met His
IA/m IIA/M<A> <T>
N_leaf
G
C
G
C
G
C
#7
Ala Arg
IIA/M IA/m<G> <C>
N_branch
G
C
A
C
G
T
#16
Thr Arg
IIA/M IA/m<G> <C>
N_branch
A
T
G
T
A
C
#24
Val Tyr
IA/m IC/M<A> <T>
N_branch
A
T
A
T
A
T
#29
Ile TyrMet,stp
IA/m IC/M<A> <T>
N_leaf
A
C
G
T
G
C
#9
Ala Cys
IIA/M IA/m<G> <C>
N_leaf
A
C
A
T
G
T
#18
Thr Cys
IIA/M IA/m<A> <T>
N_leaf
G
A
G
C
T
C
#4
Glu Leu Thr
IB/m IA/m<G> <C>
N_branch
G
A
A
C
T
T
#27
Lys Leu Thr
IIB/m IA/m<A> <T>
N_leaf
G
G
G
C
C
C
#1
Gly Pro
IIA/M IIA/M<G> <C>
N_branch
G
G
A
C
C
T
#11
Arg Prostp,Ser
IA/M IIA/M<G> <C>
N_branch
A
A
G
T
T
C
#23
Glu Phe
IB/m IIC/m<A> <T>
N_branch
A
A
A
T
T
T
#32
Lys Phe Asn
IIB/m IIC/m<A> <T>
N_leaf
A
G
G
T
C
C
#3
Gly Ser
IIA/M IIA/M<G> <C>
N_leaf
A
G
A
T
C
T
#14
Arg Serstp,Ser
IA/M IIA/M<A> <T>
N_leaf
G
G
T
C
C
A
#12
Trp Pro
IC/m IIA/M<G> <C>
N_branch
G
A
T
C
T
A
#25
stp LeuGln,Pyl Thr
IA/m<A> <T>
N_branch
G
G
C
C
C
G
#10
Arg Pro stp
IA/m IIA/M<G> <C>
N_branch
G
A
C
C
T
G
#20
Gln Leu Thr,Ser
IB/m IA/m<G> <C>
N_branch
A
G
T
T
C
A
#15
stp SerTrp,Cys
IIA/M<A> <T>
N_leaf
A
A
T
T
T
A
#31
stp Leu Gln
IA/m<A> <T>
N_leaf
A
G
C
T
C
G
#13
Arg Ser
IA/m IIA/M<G> <C>
N_leaf
A
A
C
T
T
G
#28
Gln Leu
IB/m IA/m<A> <T>
N_leaf
T
G
G
A
C
C
#6
Gly Thr
IIA/M IIA/M<G> <C>
N_branch
T
A
G
A
T
C
#21
Asp Ile
IIB/M IA/m<A> <T>
N_branch
C
G
G
G
C
C
#2
Gly Ala
IIA/M IIA/M<G> <C>
N_branch
C
A
G
G
T
C
#5
Asp Val
IIB/M IA/m<G> <C>
N_branch
T
G
A
A
C
T
#17
Ser Thr
IIA/M IIA/M<A> <T>
N_leaf
T
A
A
A
T
T
#30
Asn Ile
IIB/M IA/m<A> <T>
N_leaf
C
G
A
G
C
T
#8
Ser Ala
IIA/M IIA/M<G> <C>
N_leaf
C
A
A
G
T
T
#26
Asn Val
IIB/M IA/m<A> <T>
N_leaf
#19−Val−#24
#22−Met−#29
#7−Ala−#9
#16−Thr−#18
#19−His−#22
#7−Arg
−#16
#24−Tyr−#29
#9−Cys
−#18
#4−Glu−#23
#27−Lys−#32
#1−Gly−#3*
#11−Arg−#14
#4−Leu−#27
#1−Pro
−#11
#23−Phe−#32
#3−Ser−
#14
#1
0−
Pro
−#
12
#2
0−
Le
u−
#2
5
#1
3−
Se
r−#
15
#2
8−
Le
u−
#3
1
#12−Trp−#15
#25−stp−#31
#10−Arg−#13
#20−Gln−#28
#2
−G
ly−
#6
*
#5
−A
sp
−#
21
#8
−S
er−
#1
7
#2
6−
Asn
−#
30
#6−Thr−#17
#21−Ile−#30
#2−Ala−#8
#5−Val−#26
Route 2
Route 0
Route 3
Route 1
biosynthetic
families
Glu
Asp
Val
Ser
Phe
initial/stop
aaRS types
IIA,IIB,IIC
IA,IB,IC
M: major groove
m: minor groove
genomic codon
distribution types
<G>,<C>,<A>,<T>
pair connections
#1−Gly−#3 etc.
Colored ones indicate
route duality
node types
N_branch: branch node
N_leaf: leaf node
facets
w es
n
d
u
Fig 2C Explanation of the codon degeneracy based on the relationships among codons (pair connectionsand route dualities) in the roadmap. This is a revised plot of the roadmap Fig 1A to indicate the symmetryin the evolution of the genetic code, where the four routes are represented by four cubes respectively. Pairconnections are marked besides the evolutionary arrows in the roadmap. Route dualities are indicated bysame colours for the corresponding pair connections. The biosynthetic families of amino acids are denotedby coloured semicircles. The types of the aminoacyl-tRNA synthetases are besides the codons. Branch nodesand leaf nodes are distinguished. The 6 facets for each route are indicated on a cube at bottom right.
57
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
G
G
G
A
C
U
C
C
G
A
C
U
A
A
G
A
C
U
U
U
G
A
C
U
GGG Route 0 R
GGA Route 0 R
GGC Route 1 R
GGU Route 1 R
GC G Route 2 R
GC A Route 2 R
GC C Route 1 Y
GC U Route 1 Y
GA G Route 0 R
GA A Route 0 R
GA C Route 1 R
GA U Route 1 R
GU G Route 2 R
GU A Route 2 R
GU C Route 1 Y
GU U Route 1 Y
C GG Route 3 R
C GA Route 3 R
C GC Route 2 Y
C GU Route 2 Y
C C G Route 3 Y
C C A Route 3 Y
C C C Route 0 Y
C C U Route 0 Y
C A G Route 3 R
C A A Route 3 R
C A C Route 2 Y
C A U Route 2 Y
C U G Route 3 Y
C U A Route 3 Y
C U C Route 0 Y
C U U Route 0 Y
A GG Route 0 R
A GA Route 0 R
A GC Route 1 R
A GU Route 1 R
A C G Route 2 R
A C A Route 2 R
A C C Route 1 Y
A C U Route 1 Y
A A G Route 0 R
A A A Route 0 R
A A C Route 1 R
A A U Route 1 R
A U G Route 2 R
A U A Route 2 R
A U C Route 1 Y
A U U Route 1 Y
U GG Route 3 R
U GA Route 3 R
U GC Route 2 Y
U GU Route 2 Y
U C G Route 3 Y
U C A Route 3 Y
U C C Route 0 Y
U C U Route 0 Y
U A G Route 3 R
U A A Route 3 R
U A C Route 2 Y
U A U Route 2 Y
U U G Route 3 Y
U U A Route 3 Y
U U C Route 0 Y
U U U Route 0 Y
Fig S2c1 The distribution of codons from R- and Y-strands of Route 0 − 3 in the GCAU genetic code table.The pattern of the 4×4 codon boxes for the degenerate codons relates to such a distribution of the four routes,due to the evolution of the genetic code along the roadmap.
58
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
G
G
G
A
C
U
C
C
G
A
C
U
A
A
G
A
C
U
U
U
G
A
C
U
GGG#1 No. 1Gly
GGA#3 No. 1Gly
GGC#2 No. 1Gly
GGU#6 No. 1Gly
GCG#7 No. 2Ala
GCA#9 No. 2Ala
GCC#2 No. 2Ala
GCU#8 No. 2Ala
GAG#4 No. 3Glu
GA A#23 No. 3Glu
GAC#5 No. 4Asp
GA U#21 No. 4Asp
GUG#19 No. 5Val
GUA#24 No. 5Val
GUC#5 No. 5Val
GUU#26 No. 5Val
CGG#10 No.10Arg
CGA#13 No.10Arg
CGC#7 No.10Arg
CGU#16 No.10Arg
C CG#10 No. 6Pro
C CA#12 No. 6Pro
C CC#1 No. 6Pro
C CU#11 No. 6Pro
CAG#20 No.14Gln
CA A#28 No.14Gln
CAC#19 No.13His
CA U#22 No.13His
C UG#20 No. 8Leu
C UA#25 No. 8Leu
C UC#4 No. 8Leu
C UU#27 No. 8Leu
AGG#11 No.10Arg
A GA#14 No.10Arg
A GC#8 No. 7Ser
A GU#17 No. 7Ser
A CG#16 No. 9Thr
A CA#18 No. 9Thr
A CC#6 No. 9Thr
A CU#17 No. 9Thr
AAG#27 No.20Lys
A A A#32 No.20Lys
A A C#26 No.19Asn
AAU#30 No.19Asn
A UG#22 No.16Met
A UA#29 No.15Ile
A UC#21 No.15Ile
A UU#30 No.15Ile
UGG#12 No.12Trp
UGA#15 stop
UGC#9 No.11Cys
UGU#18 No.11Cys
U CG#13 No. 7Ser
U CA#15 No. 7Ser
U CC#3 No. 7Ser
U CU#14 No. 7Ser
UA G#25 stop
UA A#31 stop
UAC#24 No.18Tyr
UA U#29 No.18Tyr
U UG#28 No. 8Leu
U UA#31 No. 8Leu
U UC#23 No.17Phe
U UU#32 No.17Phe
biosynthetic families: Glu Asp Val Ser Phe others
Fig S2c2 The clusterings of biosynthetic families (Glu Asp Val Ser Phe) in the GCAU genetic code table.Such nice clusterings are correspondingly observed in the R- and Y-strands of Route 0− 3 in Fig 2C (denotedin the same group of colour as in the present figure). The clusterings of biosynthetic families in the presentfigure are closely related to the distribution of codons from R- and Y-strands of Route 0 − 3 in Fig S2c1,due to the recruitment of amino acids along the roadmap. Generally speaking, the amino acids are arrangedproperly in the recruitment order from No.1 to No.20 along the direction from G, C to A, U in the GCAUgenetic code table.
59
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
G
G
G
A
C
U
C
C
G
A
C
U
A
A
G
A
C
U
U
U
G
A
C
U
GGG Gly IIA m
GGA Gly IIA m
GGC Gly IIA m
GGU Gly IIA m
GCG Ala IIA M
GCA Ala IIA M
GCC Ala IIA M
GCU Ala IIA M
GAG Glu IB m
GAA Glu IB m
GAC Asp IIB M
GAU Asp IIB M
GUG Val IA m
GUA Val IA m
GUC Val IA m
GUU Val IA m
CGG Arg IA m
CGA Arg IA m
CGC Arg IA m
CGU Arg IA m
CCG Pro IIA M
CCA Pro IIA M
CCC Pro IIA M
CCU Pro IIA M
CAG Gln IB m
CAA Gln IB m
CAC His IIA M
CAU His IIA M
CUG Leu IA m
CUA Leu IA m
CUC Leu IA m
CUU Leu IA m
AGG Arg IA m
AGA Arg IA m
AGC Ser IIA m
AGU Ser IIA m
ACG Thr IIA M
ACA Thr IIA M
ACC Thr IIA M
ACU Thr IIA M
AAG Lys IIB m
AAA Lys IIB m
AAC Asn IIB M
AAU Asn IIB M
AUG Met IA M
AUA Ile IA M
AUC Ile IA M
AUU Ile IA M
UGG Trp IC m
UGA stp m
UGC Cys IA m
UGU Cys IA m
UCG Ser IIA M
UCA Ser IIA M
UCC Ser IIA M
UCU Ser IIA M
UAG stp m
UAA stp m
UAC Tyr IC M
UAU Tyr IC M
UUG Leu IA M
UUA Leu IA M
UUC Phe IIC M
UUU Phe IIC M
Fig S2c3 The distribution of types of aminoacyl-tRNA synthetases in the GCAU genetic code table. Theaminoacyl-tRNA synthetases can be divided into Class II and Class I, which can be divided into subclassesIIA, IIB, IIC, and IA, IB, IC, respectively. And the aminoacyl-tRNA synthetases can also be divided intominor groove ones (m) and major groove ones (M).
60
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
#11 #14
#27 #32
#23
#26
#8 #21 #30
#17
#16 #18
#7 #9 #22 #29
#19 #24
#20 #28
#10 #13 #25 #31
#12 #15
CCC
3’
5’
GGG
5’
3’
GGG
3’
5’
CCC
5’
3’
CCC
3’
5’
GGG
5’
3’
AGG
3’
5’
TCC
5’
3’
CCC
3’
5’
GGG
5’
3’
GAG
3’
5’
CTC
5’
3’
CCC
3’
5’
GGG
5’
3’
CGG
3’
5’
GCC
5’
3’
CCG
3’
5’
GGC
5’
3’
CCG
5’
3’
GCC
3’
5’
CGG
5’
3’
CAG
3’
5’
GTC
5’
3’
CCG
3’
5’
GGC
5’
3’
CCG
5’
3’
GCC
3’
5’
CGG
5’
3’
TGG
3’
5’
ACC
5’
3’
#1
#3
#4
#2
#5
#6
11#14#
27#32#
23#
26#
8#21#30#
17#
16#18#
7# 9#22#29#
19#24#
20#28#
10#13#25#31#
12#15#
CCC
3’
5’
GGG
5’
3’
GGG
3’
5’
CCC
5’
3’
CCC
3’
5’
GGG
5’
3’
AGG
3’
5’
TCC
5’
3’
CCC
3’
5’
GGG
5’
3’
GAG
3’
5’
CTC
5’
3’
CCC
3’
5’
GGG
5’
3’
CGG
3’
5’
GCC
5’
3’
CCG
3’
5’
GGC
5’
3’
CCG
5’
3’
GCC
3’
5’
CGG
5’
3’
CAG
3’
5’
GTC
5’
3’
CCG
3’
5’
GGC
5’
3’
CCG
5’
3’
GCC
3’
5’
CGG
5’
3’
TGG
3’
5’
ACC
5’
3’
1#
3#
4#
2#
5#
6#
Left−handed roadmap
left−handed triplex nucleic acids
D−amino acid, L−ribose
pyrimidine, purine
Right−handed roadmap
right−handed triplex nucleic acids
D−ribose, L−amino acid
purine, pyrimidine
→ ←
non−chirality
point
Fig 2D The homochirality of life originated in a winner-take-all game between the right-handed roadmap andthe left-handed roadmap. The right handed roadmap is just the roadmap in Fig 1A (only the initial subset areshown explicitly here), whose chirally symmetric partner roadmap, i. e., the left-handed roadmap, is plottedon the left. Ref to Fig S2d1 to see a heuristic model on the origin of the homochirality of life.
61
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
#11#
#22#
#33#
#44#
#55#
#66#
#77#
#88#
#99#
#1010#
#1111#
#1212#
#1313#
#1414#
#1515#
#1616#
#1717#
#1818#
#1919#
#2020#
#2121#
#2222#
#2323#
#2424#
#2525#
#2626#
#2727#
#2828#
#2929#
#3030#
#3131#
#3232#
Right−handed roadmapLeft−handed roadmap
fulfillment of the right−handed roadmap
uncompleted left−handed roadmap
(with competence)
uncompleted left−handed roadmap
(without competence)
↔competing for bases →
subs
titu
tion
←substitution
Fig S2d1 Simulation of the origin of homochirality of life. The roadmap in Fig 1A is a right-handed roadmap,whose chirally symmetric counterpart, i. e. the left-handed roadmap, can also be achieved in theory (Fig2D). In this simulation, a constant substitution rate is assumed for the substitutions from #1 to #32 for theright-handed roadmap and from 1# to 32# for the left-handed roadmap. Which chirality to be selected isabsolutely by chance. The programme can be run for many times. The two chiral roadmaps are obtainedequally. When any chiral roadmap is fulfilled (such as the right-handed roadmap in the present figure),the chirally opposite roadmap may reach a certain uncompleted stage. The simulation indicates that thecompetition for the non-chiral bases between the two chiral roadmaps may enlarge the gap between thefulfilled chiral roadmap and the uncompleted chiral roadmap.
62
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
AA
A-H
4
T
TT
-H4
AA
T-H
4 A
TT
-H4
A
TA
-H4
TA
T-H
4
TA
A-H
4
TTA-H
4
A
AC
-H3
GTT-H
3
CA
A-H
3
TTG
-H3
A
CA
-H3
TGT-H3
TCA
-H3
TGA-H3
ACT-H3
AGT-H3
GTA-H3 TAC-H
3
CTA-H3TA
G-H
3
AGA-H3
TC
T-H
3CTC
-H2
GA
G-H
2
AA
G-H
3
C
TT
-H3G
AA
-H3T
TC
-H3
AT
C-H
3
GA
T-H
3
AT
G-H
3
CA
T-H
3A
CC
-H2
GG
T-H
2
CC
A-H
2
TG
G-H
2
CA
C-H
2GT
G-H
2
A
GC
-H2
GC
T-H
2
GC
A-H
2TGC-H2
CAG-H2
CTG-H2
ACG-H2
CGT-H2
CGA-H2
TCG-H2
GAC-H2
GT
C-H
2
CCG-H1
CG
G-H
1
GCC-H
1
GG
C-H
1
CG
C-H
1G
CG
-H1
CC
C-H
1
GG
G-H
1
AG
G-H
2 CC
T-H
2 GG
A-H
2TCC-H
2
Fig 3A The tree of codons based on the genomic codon distributions, which agrees with the differentiationof the four hierarchies Hierarchy 1 − 4 (H1-4) in the roadmap. This tree of codons is plotted based on thedistance matrix for codons by averaging the correlation coefficients of genomic codon distributions amongthe species. Both the tree of codons in this figure and the tree of species in Fig S3a5 are obtained based on asame set of genomic codon distributions.
63
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
G G G G G G G G G G G G G G G G G G G G G G G G G G G
# 1 # 1 # 1 # 1 # 1 # 1 # 1 # 1 # 1
A G G G G C G G G G G A G C G G G G G G G C G G G A G
# 11 # 2 # 1 # 3 # 7 # 1 # 1 # 10 # 4
A G A G A C G G G G G A G T G G G G G G C T G G G A A
# 14 # 5 # 1 # 3 # 19 # 1 # 2 # 12 # 23
A G A A A C G A G G G A G T A G G G A G C T A G A A A
# 14 # 26 # 4 # 3 # 24 # 1 # 8 # 25 # 32
A G A A A C A A G G G A A T A A G G A G C T A A A A A
# 14 # 18 # 23 # 31 # 8 # 31 # 32
# 16 # 4 # 25
FROM the initial subset (GC content for the initial subset: GC%=13/(3*6)=72%)
TO the leaf nodes (GC content for the leaf nodes: GC%=16/(3*16)=33%)
Fig S3a1 The base substitutions along the roadmap may occur randomly at any places in a strand accordingto the roadmap Fig 1A. For a simple example, starting from Poly G in the present figure, the strand mayevolve randomly step by step according to the base substitutions in the triplex DNA along the roadmap. Afterseveral steps, a certain sequence of strand is achieved. Such base substitutions essentially determine thegenomic codon distributions for species (Fig S3a3). The GC content for the initial subset (Fig 1A) is about72%, and the GC content for the leaf nodes about 33%. This range of GC content generally agrees with therange of GC content in observations (Fig S3a2).
64
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0 10 20 30 40 50 60 70 80 90 10010
0
101
102
103
GC content (percentage)
GC
co
nte
nt
dis
trib
utio
ns f
or
do
ma
ins
Bacteria
Euryarchaeota
Crenarchaeota
Eukarya
Fig S3a2 GC content ranges for taxa, which can be explained by the evolution of sequences based on theroadmap (Fig S3a1). The origin order of domains (obtained in Fig 3C) is irrelevant to the order of the averageGC contents in taxa here.
65
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
An example for calculating genomic codon distributions: >gi|71658851|ref|NC_007220.1| Duck circovirus, complete genome1 2 3 4 5 6 7 8 9 1011 1213 141516 171819 202122 2324 252627 282930 3132 333435 363738 394041 4243 444546 474849 5051 525354 555657 585960 6162 636465 666768 6970 717273 747576 777879 8081 828384 858687 8889 909192 939495 969798 99100
1
101
201
301
401
501
601
701
801
901
1001
1101
1201
1301
1401
1501
1601
1701
1801
1901
A CCGGCGCT T GT A CT CCGT A CT CCGGGCCA A GGGGA T A GCA A CT GCA A T GGCGA A GA GCGGCA A CT A CT CA T A CA A GA GGT GGGT CT T T A CCA T T A A T A ACCCT A CCT T T GA A GA CT A T GT T CA T GT T CT A GA GT T T T GCA CGCT CGA CA A T T GCA A GT T T GCT A T CGT CGGCGA GGA GA A GGGCGCGA A CGGCA CGCCTCA CCT T CA GGGA T T CCT GA A CCT CCGA A GT A A T GCGCGA GCT GCCGCCCT T GA GGA GT CGCT GGGA GGA A GA GCCT GGCT CT CT CGT GCCCGGGGA T CT GA CGA A GA CA A T GA A GA A T A T T GCGCCA A A GA GT CGA CA T A CCT T CGA GT T GGT GA A CCA GT CT CCA A GGGT CGA T CCT CT GA T CT GGCCGA A GCGA CA T CCGCT GT GA T GGCT GGT GT GCCGCT A A CT GA GGT GGCCCGGA A GT T CCCCA CGA CT T A T GT T A T CT T T GGGCGT GGCCT GGA A CGCCT CCGT CA CCT GA T CGT CGA GA CGCA A CGT GA T T GGA A GA CCGA A GT CA T CGT T CT GA T T GGT CCGCCCT GCA CCGGA A A GA GCCGT T A T GCA T T T GA A T T T CCCGCCGA A A A CAA GT A T T A CA A A CCA CGCGGGA A GT GGT GGGA CGGT T A CT CGGGA A A T GA CGT A GT CGT CA T GGA CGA CT T T T A T GGT T GGT T GCCGT A T GA T GA T T T GCTGA GA A T T A CCGA CCGT T A CCCGT T GCGGGT T GA A T T T A A A GGCGGT A T GA CT CA GT T T GT GGCT A A A A CA T T GA T CA T A A CA A GT A A CCGCGA GCCT CGTGA T T GGT A CA A GA GCGA GT T T GA CCA T T CA GCT CT GT A CCGCA GA A T T A A CA A GT A T CT T GT GT A T A A T A T CGA CA A GT A T GA A CCCGCCCA GGCA T GT ACCCT CCCGT T CCCA A T A A A CT A CT GA GA CGA A A T T GCCGT T A GT CGT T T A T T A A GGT A A A CA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGTT A GGCA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CA GT T A A GGT T A GGCA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGT T A GGCA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGT T A GA A CA A GA T CT CGT T T A T GCCT A GA A T CCA GT GA A CT GT CCA A A T T T GA T A T A A A A T GT GA A CT GGGCCT CT A T GT CGT A T T GT GCA T T CA CA CCT GT T GT GT T GT CCGGCT T CCGT A GA CT CA T GCCCA T GCCGT A A T GCA CT A CT T GA T CT T GA GT GCA GT T GAT CCA GGT GA A A T T T T T GCCT CCGA GGA A GA A GGT A GA GT GCT T CGT A CCT T CA CCCGCT CCT T GT A T GA T GGGT T T A GGT A T GA A GT A CCGT CT GT GCA CT CT GGA GGGGT CCCA CGT T CT CCT GCT GCT GGT T GA GCCGT A CGGGT CT A T GGA CCA T CCCGT CGT GGA T GT CT T CA CT A T T T GCCCGT CCT T GT CT A T CA CCGT T GA GCCT T GA GT CT T GCT CT T CT GGT A A A T GT T GT A T GCCGGT CT GA GA GT T A T GGCT A CCCCCT T GA T CA T GT A GT A A T CGT A A GGGT A GT GGTA GA A CGGT GT T GA GCCGGT CA T GT A GCT GT T CGT GT CCCCGA A CA T CGCCCA T CT CA T GT T T A A GCCGCA T A T A T T GT T GCCA CGGGCCGGT GGGT CT GTGT A T T GT GCGCCA T CT T CT A A GCT T A A GCT T T GCCA T T T GCCT GCT GCA GT CGGT CCT GT GGT GGA GCCGA A A A A GCCA A A T A CGGT GT T CCT CGT GA CCT T A T A A GT CA CT A CGGA A A A CCGT CGT CGGGGT CT T GCGA T T CGT A GCCT T CGT CT T CT GA A T CT CCT A CGT A GA CCCCGCCT CT T CCT CCGA CCA CGA TA T GCA CGCCGA T A GGT GCGGCCGCGCA T CGGCA A CGT GGGCA GT GA CT T CCT GT A A GA T A T A A A A GA CCGCA CA A GCGCCGT T A T A T T A T T
The codon positions in the above genome for GCA (green) and TAT (red) are as follows respectively:
39 45 61 139 154 193 509 556 576 841 894 967 985 1004 1011 1029 1048 1055 1073 1092 1099 1117 1219 1274 1293 1397 1668 1747 1903 1925 1931 1940 1970
117 164 318 456 461 573 603 672 687 746 855 864 869 879 949 973 1017 1061 1105 1150 1184 1206 1213 1365 1380 1449 1479 1497 1540 1557 1671 1673 1702 1802 1900 1959 1983 1985 1988
Hence, the genomic codon intervals for GCA (green) and TAT (red) are as follows respectively:
6 16 78 15 39 316 47 20 265 53 73 18 19 7 18 19 7 18 19 7 18 102 55 19 104 271 79 156 22 6 9 30
47 154 138 5 112 30 69 15 59 109 9 5 10 70 24 44 44 44 45 34 22 7 152 15 69 30 18 43 17 114 2 29 100 98 59 24 2 3
Thus, the genomic codon distributions for GCA (green) and TAT (red) are as follows respectively (cutoff=90):
0 0 0 0 0 2 3 0 1 0 0 0 0 0 1 1 0 4 4 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0[ ]
0 2 1 0 2 0 1 0 1 1 0 0 0 0 2 0 1 1 0 0 0 1 0 2 0 0 0 0 1 2 0 0 0 1 0 0 0 0 0 0 0 0 1 3 1 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[ ]
Similarly, the genomic codon distributions for all the 64 codons can be obtained for any complete genome.
The composition vector (K=3) corresponding to the above genome:
34 34 21 46 21 43 33 27 31 25 39 24 36 35 40 44 29 28 29 40 49 27 22 35 20 28 23 25 35 34 18 39 30 24 27 31 23 32 21 30 42 25 25 23 34 24 20 33 42 38 42 38 33 31 20 34 19 27 28 39 55 25 35 25[ ]
GGG
GGC
GGA
GGT
GCG
GCC
GCA
GCT
GAG
GAC
GAA
GAT
GTG
GTC
GTA
GTT
CGG
CGC
CGA
CGT
CCG
CCC
CCA
CCT
CAG
CAC
CAA
CAT
CTG
CTC
CTA
CTT
AGG
AGC
AGA
AGT
ACG
ACC
ACA
ACT
AAG
AAC
AAA
AAT
ATG
ATC
ATA
ATT
TGG
TGC
TGA
TGT
TCG
TCC
TCA
TCT
TAG
TAC
TAA
TAT
TTG
TTC
TTA
TTT
0 10 20 30 40 50 60 70 80 900
1
2
3
4
5
Number of GCA: 33
Height:0.28889
The genomic codon distribution for GCA
0 10 20 30 40 50 60 70 80 900
1
2
3
4
5
Number of TAT: 39
Height0.33333
The genomic codon distribution for TAT
Fig S3a3 Definition of the genomic codon distribution. Take the complete genome of Duck circovirusfor example, and take the codons GCA (denoted in green) and T AT (denoted in red) among the 64 codonsfor example. First, mark all the positions of GCA and T AT in the genome respectively; second, calculatethe codon intervals for GCA and T AT respectively, and obtain the distributions of intervals of GCA andT AT respectively, where a cutoff 90 is used in this simple case when counting the number of commonintervals (a sufficiently large enough cutoff 1000 is used in this paper); and third, obtain the genomic codondistributions for GCA and T AT by normalising the distributions of intervals of GCA and T AT respectively.The genomic codon distribution for a species consists of 64 distributions for each of the 64 codons. Anexample of unnormalised genomic codon distribution (take for example only two distributions for GCA andT AT ) is illustrated in two subfigures of the present figure. Four normalised genomic codon distributionsfor species in different domains are respectively shown in Fig S3b2. In addition, the summations of theelements of the unnormalised genomic codon distributions (e. g. 33 and 39 for GCA and T AT respectively)are obtained; consequently the average height of the genomic codon distribution is obtained for each of thedistributions of the 64 codons respectively. In a so-called composition vector tree method in the molecularphylogenetics, the elements of the K = 3 composition vector for a species (e. g. the elements 33 and39 for GCA and T AT ) correspond to the summations of the elements of the genomic cordon distributionsrespectively, which are also proportional to the average heights of the genomic codon distribution.
66
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
1
101
201
301
401
501
601
701
801
901
1001
1101
1201
1301
1401
1501
1601
1701
1801
1901
2001
2101
2201
2301
2401
2501
2601
2701
2801
2901
3001
3101
3201
3301
3401
3501
3601
3701
3801
3901
ACCGGCGCT T GT A CT CCGT ACT CCGGGCCAA GGGGA T A GCA A CT GCA A T GGCGA A GA GCGGCA A CT A CT CA T A CA A GA GGT GGGT CT T T ACCA T T A A T AACCCT A CCT T T GA A GA CT A T GT T CA T GT T CT A GA GT T T T GCA CGCT CGA CAA T T GCA A GT T T GCT A T CGT CGGCGA GGA GAA GGGCGCGA ACGGCA CGCCTCA CCT T CA GGGA T T CCT GA ACCT CCGA A GT A A T GCGCGA GCT GCCGCCCT T GA GGA GT CGCT GGGA GGA AGA GCCT GGCT CT CT CGT GCCCGGGGA T CT GACGA A GA CA AT GA A GA A T A T T GCGCCA A A GA GT CGA CA T ACCT T CGA GT T GGT GA A CCA GT CT CCA A GGGT CGA T CCT CT GA T CT GGCCGA A GCGA CA T CCGCT GT GA T GGCT GGT GT GCCGCT A A CT GAGGT GGCCCGGA A GT T CCCCACGA CT T A T GT T A T CT T T GGGCGT GGCCT GGA A CGCCT CCGT CA CCT GA T CGT CGA GA CGCA A CGT GA T T GGA A GA CCGA AGT CA T CGT T CT GA T T GGT CCGCCCT GCA CCGGA A A GA GCCGT T A T GCA T T T GA A T T T CCCGCCGA A A A CAAGT A T T A CA AA CCA CGCGGGA A GT GGT GGGA CGGT T A CT CGGGA A A T GA CGT A GT CGT CAT GGA CGA CT T T T A T GGT T GGT T GCCGT A T GA T GA T T T GCTGA GA A T T A CCGA CCGT T A CCCGT T GCGGGT T GA A T T T A A AGGCGGT A T GACT CA GT T T GT GGCT A A A A CAT T GA T CA T A ACA A GT A A CCGCGA GCCT CGTGA T T GGT A CAA GA GCGA GT T T GA CCA T T CAGCT CT GT A CCGCA GA A T T A ACA A GT A T CT T GT GT A T A A T AT CGA CA A GT AT GA A CCCGCCCA GGCA T GT ACCCT CCCGT T CCCA A T A A A CT A CT GA GA CGA A A T T GCCGT T A GT CGT T T AT T A A GGT A A ACA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGTT A GGCA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CA GT T AA GGT T A GGCACT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGT T AGGCA CT T GGCAGGGT A T T A CA CT T GGGCA GCT CGGT T A A GGT T A GA A CA AGA T CT CGT T T A T GCCT A GA AT CCA GT GA A CT GT CCA A A T T T GA T A T A A A AT GT GA A CT GGGCCT CT A T GT CGT A T T GT GCA T T CA CA CCT GT T GT GT T GT CCGGCT T CCGT A GA CT CA T GCCCA T GCCGT A A T GCA CT A CT T GA T CT T GAGT GCA GT T GAT CCA GGT GA AA T T T T T GCCT CCGA GGA A GAA GGT A GA GT GCT T CGT A CCT T CA CCCGCT CCT T GT A T GA T GGGT T T A GGT A T GA A GT A CCGT CT GT GCACT CT GGA GGGGT CCCA CGT T CT CCT GCT GCT GGT T GA GCCGT A CGGGT CT AT GGA CCA T CCCGT CGT GGA T GT CT T CA CT AT T T GCCCGT CCT T GT CT A T CACCGT T GA GCCT T GA GT CT T GCT CT T CT GGT A A A T GT T GT A T GCCGGT CT GA GA GT T A T GGCT A CCCCCT T GA T CA T GT AGT A A T CGT A AGGGT A GT GGTAGA A CGGT GT T GA GCCGGT CA T GT A GCT GT T CGT GT CCCCGA A CA T CGCCCA T CT CA T GT T T A A GCCGCAT A T A T T GT T GCCA CGGGCCGGT GGGT CT GTGT A T T GT GCGCCA T CT T CT AA GCT T A A GCT T T GCCA T T T GCCT GCT GCA GT CGGT CCT GT GGT GGA GCCGA A A A A GCCA AA T A CGGT GT T CCT CGT GA CCT T A T A A GT CACT A CGGA A A ACCGT CGT CGGGGT CT T GCGAT T CGT A GCCT T CGT CT T CT GA A T CT CCT A CGT A GA CCCCGCCT CT T CCT CCGA CCA CGATAT GCA CGCCGA T A GGT GCGGCCGCGCA T CGGCA A CGT GGGCA GT GA CT T CCT GT A A GA T AT A A A A GA CCGCA CA A GCGCCGT T A T A T T A T T A CCGGCGCTT GT A CT CCGT A CT CCGGGCCA A GGGGA T A GCA A CT GCA A T GGCGA A GA GCGGCA A CT A CT CA T A CA A GA GGT GGGT CT T T A CCA T T A A T AA CCCT A CCT TT GA A GA CT A T GT T CA T GT T CT A GA GT T T T GCA CGCT CGA CA A T T GCA A GT T T GCT A T CGT CGGCGA GGA GA A GGGCGCGAA CGGCA CGCCT CA CCT T CAGGGA T T CCT GAA CCT CCGA A GT A A T GCGCGAGCT GCCGCCCT T GA GGA GT CGCT GGGA GGAA GA GCCT GGCT CT CT CGT GCCCGGGGA T CT GA CGA A GA CAAT GA A GA A T AT T GCGCCA A AGA GT CGA CA T A CCT T CGA GT T GGT GA A CCAGT CT CCA A GGGT CGA T CCT CT GA T CT GGCCGA A GCGA CA T CCGCT GT GATGGCT GGT GT GCCGCT A A CT GA GGT GGCCCGGA A GT T CCCCA CGA CT T A T GT T A T CT T T GGGCGT GGCCT GGA A CGCCT CCGT CA CCT GA T CGT CGA GA CGCA A CGT GA T T GGA A GA CCGAA GT CA T CGT T CT GA T T GGT CCGCCCT GCA CCGGA A A GA GCCGT T A T GCA T T T GA A T T T CCCGCCGA A A A CA A GT A T T A CAAA CCA CGCGGGA A GT GGT GGGA CGGT T A CT CGGGA A A T GACGT A GT CGT CA T GGA CGA CT T T T A T GGT T GGT T GCCGT A T GA T GA T T T GCT GA GA A T T ACCGA CCGT T A CCCGT T GCGGGT T GA A T T T A AA GGCGGT A T GA CT CA GT T T GT GGCT A A A A CA T T GA T CA T AA CA A GT A A CCGCGA GCCT CGT GA T T GGT ACAA GA GCGA GT T T GA CCA T T CA GCT CT GT A CCGCA GA A T T AA CA A GT A T CT T GT GT A T A A T A T CGA CA A GT A T GA A CCCGCCCA GGCA T GT A CCCT CCCGTT CCCA A T A A ACT A CT GA GA CGA A A T T GCCGT T A GT CGT T T A T T A A GGT A AA CA CT T GGCAGGGT A T T A CACT T GGGCA GCT CGGT T A A GGT T A GGCA CT TGGCA GGGT A T T A CA CT T GGGCA GCT CA GT T A A GGT T A GGCA CT T GGCA GGGT A T T A CA CT T GGGCA GCT CGGT T A A GGT T A GGCA CT T GGCA GGGT A T T ACA CT T GGGCAGCT CGGT T A AGGT T A GA A CAA GA T CT CGT T T A T GCCT A GAA T CCA GT GA ACT GT CCA A A T T T GA T A T A A AA T GT GA A CT GGGCCT CT A T GT CGT A T T GT GCA T T CA CA CCT GT T GT GT T GT CCGGCT T CCGT A GA CT CA T GCCCA T GCCGT A A T GCA CT ACT T GA T CT T GA GT GCA GT T GA T CCA GGT GAAA T T T T T GCCT CCGA GGA A GA A GGT A GA GT GCT T CGT A CCT T CA CCCGCT CCT T GT A T GAT GGGT T T A GGT A T GA A GT A CCGT CT GT GCACT CT GGA GGGGT CCCA CGT T CT CCT GCT GCT GGT T GA GCCGT A CGGGT CT A T GGA CCA T CCCGT CGT GGAT GT CT T CA CT A T T T GCCCGT CCT T GT CT A T CA CCGT T GAGCCT T GA GT CT T GCT CT T CT GGT A A A T GT T GT A T GCCGGT CT GA GA GT T A T GGCT A CCCCCT T GA T CA T GT A GT A A T CGT AA GGGT A GT GGT A GA A CGGT GT T GA GCCGGT CA T GT A GCT GT T CGT GT CCCCGA A CA T CGCCCA T CT CA T GT T T A A GCCGCA T A T A T T GT T GCCA CGGGCCGGT GGGT CT GT GT A T T GT GCGCCA T CT T CT A A GCT T A A GCT T T GCCA T T T GCCT GCT GCAGT CGGT CCT GT GGT GGA GCCGA A A A A GCCAA A T A CGGT GT T CCT CGT GA CCT T A T A A GT CACT A CGGA A AA CCGT CGT CGGGGT CT T GCGA T T CGT A GCCT T CGT CT T CT GA A T CT CCT ACGT A GA CCCCGCCT CT T CCT CCGA CCA CGAT A T GCA CGCCGA T A GGT GCGGCCGCGCA T CGGCA A CGT GGGCA GT GA CT T CCT GT A A GA T A T A A A A GA CCGCA CA A GCGCCGT T A T A T T AT T
Genomic codon distribution for GCA (First half: original genome)0 0 0 0 0 2 3 0 1 0 0 0 0 0 1 1 0 4 4 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
Genomic codon distribution for GCA (Second half: its duplication)0 0 0 0 0 2 3 0 1 0 0 0 0 0 1 1 0 4 4 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
Genomic codon distribution for GCA (the duplicated genome)0 0 0 0 0 4 6 0 2 0 0 0 0 0 2 2 0 8 8 2 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0
0 20 40 60 800
1
2
3
4
Genomic codon distribution for a genome and its duplicatioin
0 20 40 60 800
2
4
6
8
Almost invariance of the genomic codon distribution in the duplication of a genome
Fig S3a4 Invariance of the genomic codon distribution after genome duplication. If a genome expands byduplicating itself, the genomic codon distribution does not change due to the normalisation. In the presentfigure, the distribution in the left subfigure and the distribution in the right subfigure will be same afternormalisation. In this sense, the genomic codon distribution is a conservative and intrinsic feature of aspecies when its genome evolves generally by large scale duplications (Fig S4a2). The trees of species in FigS3a5 and Fig S3d8 are obtained based on the genomic codon distributions of species with large variation ofgenome sizes, where the genome sizes themselves do not play a significant role also due to the normalisationin the definition of the genomic codon distribution.
67
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
001BCya518BCya522BCya715BCya717BCya
387B720BCya
742AE004BPc
005B306BPd498BPd335BPb343BPb418BPb478BPb026BPc619BPc391BPc166BPc628BPc078BPd
213BAct
215BAct
216BAct
217BAct
189BPc
603BPc
604BPc
611BPc
610BPc
617BPc
614BPc
606BPc
607BPc
608BPc
609BPc
616BPc
605BPc
612BPc
613BPc
615BPc
259BPc
262BPc
265BPc
266BPc
267BPc
279BPc
268BPc
270BPc
277BPc
271BPc
272BPc
273BPc
280BPc
274BPc
276BPc
275BPc
278BPc
281BPc
269BPc
638BPc
639BPc
641BPc
643BPc
642BPc
644BPc
263BPc
618BPc
649BPc
320BPc
340BPc
506B
791BPc
792BPc
793BPc
795BPc
794BPc
796BPc
797BPc
799BPc
801BPc
798BPc
800BPc
802BPc
640BPc
475BPc
476BPc
260BPc
345BPc
346BPc
185BCho
436BAct
787BPc
788BPc
789BPc
790BPc
336BCho
031BPc
531BPc
392BPc
505BPc
626BPc
620BPc
629BPc
630BPc
621BPc
622BPc
623BPc
624BPc
631BPc
634BPc
636BPc
625BPc
627BPc
632BPc
633BPc
635BPc
637BPc
764BPc
765BPc
771BPc
772BPc
768BPc
769BPc
770BPc
511BPb
512BPb
750BPc
349BF
358BF
350BF
351BF
549BPc
550BPc
551BPc
186BC
hl
220BPc
224BPc
222BPc
221BPc
223BPc
532BPc
598BPc
180BC
hl803B
pa156A
E303B
Pd
181BC
hl
182BC
hl
528BC
hl
183BC
hl
501BC
hl
529BC
hl407A
E410A
E324A
E017B
Pc
018BPc
019BPc
020BPc
389BPc
048BF
062BF
063BF
065BF
060BF
061BF
064BF
069BC
hl070B
Chl
071BC
hl072B
Chl
492B
Chl
713B
Pe
513B
Chl
514B
Chl
210B
F411A
E413A
E412A
E399A
E738A
E737A
E415A
E471B
Pe
163B
F240B
F241B
F474B
Pc
726B
242B
Pd
243B
F416B
724B
cya
231B
Chl
285B
Chl
752B
Spi
486B
F233B
Cho
234B
Cho
235B
Cho
038B
F282B
F300B
F301B
F036B
Pa
756B
Act
757B
Act
753B
Spi
754B
Spi
352B
F353B
F758A
E354B
F464B
Pb
465B
Pb
466B
Pb
469B
Pb
467B
Pb
468B
Pb
249B
Pc
258B
010B
Pc
012B
Pc
013B
Pc
011B
Pc
014B
Pc
016B
Pc
211B
Pc
530B
552B
Pc
379B
F380B
fir027B
Pc
766B
Pc
767B
Pc
313B
Pc
314B
Pc
315B
Pc
317B
Pc
316B
Pc
496B
Pc
359B
F360B
F497B
F361B
F318B
Pc
319B
Pc
363B
F364B
F365B
F675B
F677B
F676B
F679B
F700B
F701B
F702B
F681B
F684B
F686B
F682B
F685B
F683B
F687B
F698B
F689B
F695B
F692B
F697B
F694B
F691B
F690B
F696B
F693B
F688B
F699B
F703B
F705B
F704B
F680B
F678B
F073B
Pa
074B
Pa
076B
Pa
075B
Pa
367B
Pc
368B
Pc
369B
Pc
370B
Pc
743B
The
032B
Cya
482B
Cya
481B
Cya
426B
Cya
228B
Cya
229B
Cya
230B
Cya
755B
Cya
049B
F050B
F051B
F053B
F06
6BF
067B
F05
9BF
052B
F05
5BF
054B
F05
6BF
058B
F06
8BF
057B
F48
4B38
6BF
261B
F38
1BF
382B
F38
3BF
384B
F38
5BF
654B
F65
5BF
661B
F65
6BF
667B
F66
4BF
663B
F66
6BF
657B
F65
8BF
659B
F66
0BF
662B
F66
5BF
671B
F67
0BF
668B
F66
9BF
348B
F35
5BF
357B
F35
6BF
493B
Ver
029B
F03
0BF
208B
F31
1BChl
462B
F72
9BF
731B
F73
0BF
203B
F28
3BThe
503B
The
137B
015B
Pc
037B
Pa
470B
Pa
212B
F
167B
Ver
168B
Ver
169B
Ver
170B
Ver
171B
Ver
172BVer
173BVer
174BVer
175BVer
176BVer
178BVer
177BVer
516BCya
519BCya
525BCya
526BCya
372BSpi
373BSpi
374BSpi
375BSpi
376BSpi
377BSpi
759B
805E2
818E15
807E4
819E16
776BPa
284BF
286BChl
402AE
403AE
404AE
405AE
406AE
457BF
583BPa
587BPa
591BPa
592BPa
589BPa
588BPa
584BPa
585BPa
586BPa
672AC
709AC
710AC
711AC
751BPc
192BF
193BF
202BF
362BF
740BThe
741BThe
139BPe
140BPe
141BPe
142BPe
149BChl
148BChl
774BPa
326BPe
328BPe
329BPe
332BPe
330BPe
331BPe
333BPe
777B
327BPe
039B040AE
733AE
734AE
744BThe
745BThe
746BThe
397AE
155A248AC
396AC
557AE
559AE
558AE
138AC
507AE002BF287BPc
288BPc
294BPc
292BPc295BPc289BPc290BPc291BPc293BPc160BPc162BPc590BPa593BPa077B143BPe144BPe147BPe145BPe146BPe447BF337BAqu712BAqu400AE515BCya521BCya520BCya523BCya517BCya524BCya735B775BPa366BPd804E1806E3808E5810E7811E8814E11815E12817E14809E6812E9816E13813E10023AC
338AC553AC556AC554AC555AC739AC341AC736AC151BPc451BF452BF453BF194BF196BF197BF198BF201BF195BF209BF199BF200BF205BF206BF207BF204BF250B251B398AE477AC401AE449BF044BF488B158BF157BPa446BF459BF490BPa491BPa090BSpi094BSpi
091BSpi092BSpi
095BSpi097BSpi
003BPa308BPa
309BPa047BPa
778BPa388BPa
582BPa046BPb
749BPb098BPa
099BPa100BPa
575BPa576BPa
577BPa578BPa
580BPa579BPa
483BPa489B425BPc
252BPa645BPa
008BPb238BPb
187BPc086BPb
087BPb009BPb
089BPb510BPb
533BPc534BPc
536BPc535BPc
537BPc538BPc
540BPc545BPc
225BPb
560BPb
564BPb
045BPb
763BPb
673BPc
674BPc
247BPd
164BPa
165BPa
419BPa
423BPa
420BPa
421BPa
422BPa
028BPc
325BPc
494BPa
237BD
ei
297BA
ct
298BA
ct
601BA
ct
602BA
ct
429BA
ct
430BA
ct
433BA
ct
439BA
ct
445BA
ct
434BA
ct
435BA
ct
438BA
ct
573BA
ct
732BA
ct
460BPd
581BPa
714B
115BPb
119BPb
117BPb
118BPb
120BPb
116BPb
121BPb
135BPb
127BPb
126BPb
130BPb
131BPb
132BPb
133BPb
134BPb
122BPb
125BPb
124BPb
123BPb
378B
417BPb
653BPa
570BPa
572BPa
571BPa
651B
652BPa
296BA
ct
599BA
ct
706Bact
707Bact
708Bact
479Bact
006B
Aci
650B
Aci
307B
cya
085B
Pb
232B
Pb
546B
Pc
547B
Pc
548B
Pc
539B
Pc
541B
Pc
543B
Pc
542B
Pc
544B
Pc
509B
Pb
574B
Pb
128B
Pb
129B
Pb
136B
Pb
561B
Pb
562B
Pb
563B
Pb
188B
Pc
779B
Pc
781B
Pc
780B
Pc
782B
Pc
783B
Pc
784B
Pc
785B
Pc
786B
Pc
080B
Act
082B
Act
083B
Act
084B
Act
081B
Act
245B
Pd
246B
Pd
024B
Pa
395B
Pa
566B
Pa
567B
Pa
568B
Pa
569B
Pa
647B
Pa
648B
Pa
495B
Pa
472B
Pa
473B
Pa
487B
Pa
339B
Pa
394B
Pa
424B
Pa
264B
Pa
342B
Pa
596B
Pa
646B
Pa
390B
Pa
079B
101B
pa
102B
pa
105B
pa
103B
pa
107B
Pa
108B
Pa
106B
pa
104B
pa
485B
Pa
244B
Pd
312B
310B
Pa
594B
Cho
595B
Cho
021B
PC
022B
PC
153B
Pc
302B
Pd
304B
Pd
305B
Pd
499B
Pd
179B
Chl
184B
Chl
500B
Chl
725B
Pd
154B
F025B
239B
Pd
427B
F502B
F334B
F
727B
Pd
042B
Act
043B
Act
214B
Act
218B
Act
219B
Act
428B
Act
431B
Act
432B
Act
440B
Act
441B
Act
442B
Act
443B
Act
437B
Act
527B
Act
226B
cya
227B
cya
716B
Cya
722B
Cya
723B
Cya
721B
Cya
565B
Act
718B
Cya
719B
Cya
236B
Dei
321A
E
463A
E
408A
E
600B
Chl
409A
E
371B
act
007B
Act
444B
Act
322A
E
323A
E
088B
Pb
508B
033B
Pd
035B
Pd
034B
Pd
190B
Act
191B
Act
347B
Act
504B
Pa
597B
Act
480B
Act
344B
Act
929V
dD
909V
dD
461A
041B
Pe
299B
414A
E
393B
F
760B
F
761B
F
762B
F
109B
Pc
110B
Pc
113B
Pc
114B
Pc
150B
Pc
112B
Pc
253B
Pa
254B
Pa
255B
Pa
256B
Pa
257B
Pa
908V
dD
873V
dD
931V
dD
093B
Spi
096B
Spi
450B
F
925V
dD
927V
UP
866V
dD
895V
dD
896V
dD
950V
dD
951V
dD
728B
Pb
901V
dD
773B
Pc
747B
Dei
748B
Dei
159B
448B
F
454B
F
458B
F
456B
F
827V
dD
919V
dD
875V
dD
823V
dD
882V
dD
884V
dD
824V
dD
821V
dD
825V
dD
915V
dD
917V
dD
916V
dD11
1BPc
887V
dD
889VdD
891VdD
890VdD
888VdD
822VdD
826VdD
834VU
P
902VdD881VdD
921VdD920VdD
918VdD455BF
161BChl
871VdD867VdD
870VdD868VdD
869VdD906VdD
913VdD900VdD
922VdD938VdD
936VsR928VdD
940VdD946VdD
941VdD874VdD
907VdD883VdD937VsR935VsR
914VdD152BPc
910VdD872VUP
894VRT945VsR944VsR
877VUV924VdD933VsR932VsR926VdD942VsD943VsD
847VdRm947VsR892VsR854VdRm
849VdRm857VdRm
836VdRm856VdRm
845VdRm864VdRm838VdRm832VsDm905VsDo876VsD885VsRb930VsD899VsD830VsDc893VsD903VsD923VsD911VsDcl833VsD835VUC904VsDo878VsRe828VdDB912VsDcl829VdDB840VdRm846VdRm848VdRm863VdRm839VdRm862VdRm837VdRm853VdRm860VdRm831VsD952VSat934VSat949VdRr843VdRm850VdRm859VdRm844VdRm861VdRm948VdRr886VsRb855VdRm
851VdRm841VdRm858VdRm879VsRe
852VdRm842VdRm865VdRm880VsRe939VUC897VUC898VUC820VUC
Fig S3a5 Tree of species according to the average correlation coefficients among genomic codon distributionsby averaging for the 64 codons. This tree is reasonable to some extent. Considering the also reasonable treeof codons in Fig 3A that is obtained based on the same data, it is inferred that an intrinsic relationship existsbetween the tree of life and the evolution of the genetic code. Please enlarge to see details. Refer to the legendof Fig 3D in the appendix to see the abbreviations of the taxa.
68
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
BeeT
halecressN
emat
ode
Beetle
Cerevisiae
Cow
Dog
Zebrafish
Horse O
poss
um
Mou
se
Rat
Hum
an
Chimpanzee
M. Mulatta
Chicken
Fig S3a6 Tree of eukaryotes based on the average correlation coefficients among genomic codondistributions. This tree is reasonable to some extent. Note that most eukaryotic genomes are non-codingDNA. So this reasonable tree is mostly based on the non-coding DNA, which indicates that the non-codingDNA, rather than the coding DNA, plays a principal role in the phylogeny of eukaryotes.
69
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
0.05
Virus
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
0.05
Bacteria
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
0.05
Archaea
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
0.05
Eukaryote
Fig 3B Features of genomic codon distributions for domains based on complete genome sequences. Referto Fig S3b1 to see the genomic codon distributions of Crenarchaeota and Euryarchaeota; refer to Fig S3b2 tosee the genomic codon distributions for species; and refer to Fig S3b3 to see the simulations of the genomiccodon distributions.
70
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
0.05
Crenarchaeota
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
0.05
Euryarchaeota
Fig S3b1 Genomic codon distributions for Crenarchaeota and Euryarchaeota based on genomic data. Thefeatures for both Crenarchaeota and Euryarchaeota agree with that for Archaea in Fig 3B, while they arequite different from the features for Eukarya in Fig 3B. So the relationship between Crenarchaeota andEuryarchaeota is closer than with Eukarya, which tends to agree with the three-domain tree rather than theeocyte tree.
71
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0 9 18 27 36 45 54 63 72 81 90 990
0.02
0.04
0.06
0.08
0.1
Virus: Yersinia phage Berlin
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
0.05
Bacteria: Zymomonas mobilis ZM4
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
0.05
Archaea: Methanococcus vannielii SB
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
0.05
Eukaryote: Homo sapiens
Fig S3b2 Features of genomic codon distributions of species in different domains based on genomic data.The features for species in the present figures agree with the features for the corresponding domains in Fig3B respectively. For example, the features for eukaryotes are quite different from the features for the otherdomains, either for any species in the present figure or for the domain in Fig 3B.
72
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0 9 18 27 36 45 54 63 72 81 90 990
0.02
0.04
0.06
0.08
simulation for virus
0 9 18 27 36 45 54 63 72 81 90 990
0.02
0.04
0.06
0.08
simulation for bacteria
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
simulation for archaea
0 9 18 27 36 45 54 63 72 81 90 990
0.01
0.02
0.03
0.04
simulation for eukaryote
Fig S3b3 Simulations of features of genomic codon distributions for the different domains. This simulationresults for the domains agree with the observations based on genomic data in Fig 2B, respectively. Thesimulations also reveal the mechanism that results in such features.
73
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10#11#12#13#14#15#16#17#18#19#20#21#22#23#24#25#26#27#28#29#30#31#32
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
Recruitment order of the 32 codon pairs
Heig
ht ord
er
of th
e 6
4 g
enom
ic c
odon d
istr
ibutions for
dom
ain
s
Virus
Bacteria
Crenarchaeota
Euryarchaeota
Archaea
Eukaryote
Fig 3C On the origin order of the domains. The recruitment order of the 32 codon pairs indicates anevolutionary direction, which reveals the origin order of the domains.
74
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
64 CGA #13 H2 GCC #02 H1 AAU #30 H4 AAU #30 H4 AAU #30 H4 AUU #30 H4
63 GGC #02 H1 GGC #02 H1 AUU #30 H4 AUU #30 H4 AUU #30 H4 AAU #30 H4
62 CCG #10 H1 CGC #07 H1 GAA #23 H3 GAA #23 H3 GAA #23 H3 UUU #32 H4
61 CGG #10 H1 GCG #07 H1 UUC #23 H3 UUC #23 H3 UUC #23 H3 AAA #32 H4
60 UCG #13 H2 CCG #10 H1 AUA #29 H4 AUA #29 H4 AUA #29 H4 UUA #31 H4
59 CGC #07 H1 CGG #10 H1 UAU #29 H4 UAU #29 H4 UAU #29 H4 UAA #31 H4
58 GCG #07 H1 GCA #09 H2 AAG #27 H3 AAG #27 H3 AAG #27 H3 CUU #27 H3
57 GCC #02 H1 UGC #09 H2 CUU #27 H3 CUU #27 H3 CUU #27 H3 AAG #27 H3
56 GAC #05 H2 UCG #13 H2 UAA #31 H4 UAA #31 H4 UAA #31 H4 UCU #14 H3
55 UGG #12 H2 CGA #13 H2 UUA #31 H4 UUA #31 H4 UUA #31 H4 AGA #14 H3
54 CAA #28 H3 AUU #30 H4 GGA #03 H2 GGA #03 H2 GGA #03 H2 UUC #23 H3
53 CUG #20 H2 AAU #30 H4 UCC #03 H2 UCC #03 H2 UCC #03 H2 GAA #23 H3
52 ACG #16 H2 AUC #21 H3 AAA #32 H4 AUC #21 H3 AAA #32 H4 UGA #15 H3
51 GAA #23 H3 GAU #21 H3 UUU #32 H4 GAU #21 H3 UUU #32 H4 UCA #15 H3
50 GCA #09 H2 CAG #20 H2 UCA #15 H3 UCA #15 H3 UCA #15 H3 CUG #20 H2
49 ACC #06 H2 CUG #20 H2 UGA #15 H3 AAA #32 H4 UGA #15 H3 CAG #20 H2
48 AAG #27 H3 AGC #08 H2 AUC #21 H3 UGA #15 H3 AUC #21 H3 UUG #28 H3
47 CAG #20 H2 UUC #23 H3 GAU #21 H3 UUU #32 H4 GAU #21 H3 CAA #28 H3
46 GAU #21 H3 GCU #08 H2 AGG #11 H2 AGA #14 H3 AGA #14 H3 UAU #29 H4
45 GCU #08 H2 GAA #23 H3 AGA #14 H3 AGG #11 H2 UCU #14 H3 AUA #29 H4
44 UGA #15 H3 CCA #12 H2 CCU #11 H2 UCU #14 H3 AGG #11 H2 AUG #22 H3
43 AUC #21 H3 UGG #12 H2 UCU #14 H3 CCU #11 H2 CCU #11 H2 CAU #22 H3
42 GGU #06 H2 UUG #28 H3 CAA #28 H3 CAA #28 H3 CAA #28 H3 UGU #18 H3
41 GUC #05 H2 CAA #28 H3 GAG #04 H2 UUG #28 H3 UUG #28 H3 ACA #18 H3
40 UCA #15 H3 UCA #15 H3 UUG #28 H3 GAG #04 H2 GAG #04 H2 UGG #12 H2
39 CCA #12 H2 UGA #15 H3 CUC #04 H2 CUC #04 H2 CUC #04 H2 CCA #12 H2
38 GGA #03 H2 CUU #27 H3 CAU #22 H3 CAU #22 H3 CAU #22 H3 CCU #11 H2
37 AAC #26 H3 AAG #27 H3 AUG #22 H3 AUG #22 H3 AUG #22 H3 AGG #11 H2
36 AGC #08 H2 CAU #22 H3 GGC #02 H1 CAG #20 H2 CAG #20 H2 AGU #17 H3
35 CGU #16 H2 AUG #22 H3 GCC #02 H1 CUG #20 H2 CUG #20 H2 ACU #17 H3
34 GAG #04 H2 UUU #32 H4 CAG #20 H2 GGC #02 H1 GGC #02 H1 UCC #03 H2
33 AUG #22 H3 AAA #32 H4 CUG #20 H2 GCC #02 H1 GCC #02 H1 GGA #03 H2
32 UUC #23 H3 ACC #06 H2 CCA #12 H2 CCA #12 H2 AAC #26 H3 GUU #26 H3
31 UGC #09 H2 GGU #06 H2 AAC #26 H3 AAC #26 H3 GUU #26 H3 AAC #26 H3
30 GUG #19 H2 AAC #26 H3 UGG #12 H2 UGG #12 H2 CCA #12 H2 CUC #04 H2
29 ACA #18 H3 GUU #26 H3 GUU #26 H3 GUU #26 H3 UGG #12 H2 GAG #04 H2
28 CAC #19 H2 UUA #31 H4 AGC #08 H2 AGC #08 H2 AGC #08 H2 UGC #09 H2
27 AGG #11 H2 UAA #31 H4 GCU #08 H2 GCU #08 H2 GCU #08 H2 GCA #09 H2
26 AGA #14 H3 CGU #16 H2 CGG #10 H1 CGG #10 H1 CGG #10 H1 GAU #21 H3
25 CAU #22 H3 ACG #16 H2 CCG #10 H1 CCG #10 H1 CCG #10 H1 AUC #21 H3
24 GUU #26 H3 UCC #03 H2 UCG #13 H2 UCG #13 H2 UCG #13 H2 GCU #08 H2
23 UUG #28 H3 GGA #03 H2 CGA #13 H2 CGA #13 H2 CGA #13 H2 AGC #08 H2
22 AAU #30 H4 GUC #05 H2 ACC #06 H2 GCA #09 H2 GCA #09 H2 UAG #25 H3
21 UCC #03 H2 GAC #05 H2 GCA #09 H2 UGC #09 H2 ACC #06 H2 CUA #25 H3
20 AUU #30 H4 CCU #11 H2 GGU #06 H2 ACC #06 H2 UGC #09 H2 GUG #19 H2
19 CUC #04 H2 AGG #11 H2 UGC #09 H2 GGU #06 H2 GGU #06 H2 CAC #19 H2
18 CCU #11 H2 UAU #29 H4 AGU #17 H3 AGU #17 H3 AGU #17 H3 GUA #24 H3
17 UGU #18 H3 AUA #29 H4 ACU #17 H3 ACU #17 H3 ACU #17 H3 UAC #24 H3
16 AAA #32 H4 CAC #19 H2 ACA #18 H3 ACA #18 H3 ACA #18 H3 GGU #06 H2
15 CUU #27 H3 GUG #19 H2 UGU #18 H3 UGU #18 H3 UGU #18 H3 ACC #06 H2
14 ACU #17 H3 UCU #14 H3 UAC #24 H3 UAC #24 H3 UAC #24 H3 GCC #02 H1
13 UAC #24 H3 AGA #14 H3 GUA #24 H3 GUA #24 H3 GUA #24 H3 GGC #02 H1
12 UCU #14 H3 CUC #04 H2 GAC #05 H2 GAC #05 H2 GAC #05 H2 GUC #05 H2
11 AGU #17 H3 GAG #04 H2 GUC #05 H2 GUC #05 H2 GUC #05 H2 GAC #05 H2
10 GGG #01 H1 ACA #18 H3 ACG #16 H2 CGU #16 H2 CGU #16 H2 CCC #01 H1
9 CCC #01 H1 UGU #18 H3 CGU #16 H2 ACG #16 H2 ACG #16 H2 GGG #01 H1
8 UUU #32 H4 CCC #01 H1 CGC #07 H1 CGC #07 H1 CGC #07 H1 CGU #16 H2
7 UAU #29 H4 GGG #01 H1 GCG #07 H1 GCG #07 H1 GCG #07 H1 ACG #16 H2
6 GUA #24 H3 ACU #17 H3 UAG #25 H3 UAG #25 H3 UAG #25 H3 CCG #10 H1
5 UAA #31 H4 AGU #17 H3 CUA #25 H3 CUA #25 H3 CUA #25 H3 CGG #10 H1
4 UUA #31 H4 UAC #24 H3 CAC #19 H2 CAC #19 H2 CAC #19 H2 UCG #13 H2
3 AUA #29 H4 GUA #24 H3 GUG #19 H2 GUG #19 H2 GUG #19 H2 CGA #13 H2
2 CUA #25 H3 CUA #25 H3 GGG #01 H1 GGG #01 H1 GGG #01 H1 CGC #07 H1
1 UAG #25 H3 UAG #25 H3 CCC #01 H1 CCC #01 H1 CCC #01 H1 GCG #07 H1
Height order of the 64 genomic codon distributions for domains
Order Virus Bacteria Crenarchaeota Euryarchaeota Archaea Eukaryote
H1: Hierarchy 1 H2: Hierarchy 2 H3: Hierarchy 3 H4: Hierarchy 4
Fig S3c1 The orders of the average distribution heights for different domains, where the values of genomiccodon distributions here are obtained based on genomic data in Fig 3B and Fig S3b1. The orders of theaverage distribution heights for Archaea and Eukarya are more similar with respect to Hierarchy 1 − 4 thanthat for Virus and Bacteria.
75
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
64 GAG #04 H2 GGC #02 H1 UUA #31 H4 AUU #30 H4
63 GCG #07 H1 GCC #02 H1 UAA #31 H4 UUA #31 H4
62 CGG #10 H1 GCG #07 H1 AUU #30 H4 UAU #29 H4
61 GGC #02 H1 CCG #10 H1 AAU #30 H4 UUG #28 H3
60 AGG #11 H2 CGA #13 H2 AUA #29 H4 GUU #26 H3
59 CGA #13 H2 UGC #09 H2 UCA #15 H3 UAA #31 H4
58 GCC #02 H1 CGG #10 H1 CAU #22 H3 AAU #30 H4
57 CGC #07 H1 CAG #20 H2 AAC #26 H3 UGU #18 H3
56 GGG #01 H1 CCU #11 H2 AAG #27 H3 CUU #27 H3
55 CCG #10 H1 ACC #06 H2 GAA #23 H3 UUC #23 H3
54 ACG #16 H2 CGC #07 H1 UAU #29 H4 UCU #14 H3
53 AGC #08 H2 AGC #08 H2 CUU #27 H3 AUA #29 H4
52 GGA #03 H2 UGG #12 H2 AUC #21 H3 UUU #32 H4
51 GAC #05 H2 AGG #11 H2 AGU #17 H3 AGU #17 H3
50 CAG #20 H2 GCA #09 H2 UAG #25 H3 UGA #15 H3
49 GCA #09 H2 CGU #16 H2 UAC #24 H3 CUA #25 H3
48 AGA #14 H3 GAC #05 H2 CAA #28 H3 AUC #21 H3
47 GGU #06 H2 CUG #20 H2 GUU #26 H3 UAG #25 H3
46 GUG #19 H2 GGG #01 H1 UUC #23 H3 GAU #21 H3
45 AAG #27 H3 GCU #08 H2 ACU #17 H3 ACU #17 H3
44 CCC #01 H1 CAC #19 H2 GUA #24 H3 GUA #24 H3
43 GAA #23 H3 UCG #13 H2 UGA #15 H3 UCA #15 H3
42 GUC #05 H2 GUG #19 H2 UUG #28 H3 AUG #22 H3
41 CCA #12 H2 GGU #06 H2 AGA #14 H3 UAC #24 H3
40 CAC #19 H2 ACG #16 H2 GAU #21 H3 CAU #22 H3
39 ACC #06 H2 CCC #01 H1 ACA #18 H3 GAA #23 H3
38 UCG #13 H2 CCA #12 H2 UCU #14 H3 AAG #27 H3
37 CGU #16 H2 GUC #05 H2 CUA #25 H3 AGA #14 H3
36 UGC #09 H2 GGA #03 H2 AUG #22 H3 AAA #32 H4
35 UGG #12 H2 CUC #04 H2 AAA #32 H4 AAC #26 H3
34 AAC #26 H3 GAG #04 H2 CGA #13 H2 CAA #28 H3
33 CAA #28 H3 AGU #17 H3 ACG #16 H2 ACA #18 H3
32 GCU #08 H2 UAG #25 H3 UGU #18 H3 CUG #20 H2
31 ACA #18 H3 GAA #23 H3 UUU #32 H4 GUC #05 H2
30 GUA #24 H3 AAG #27 H3 CAG #20 H2 CCU #11 H2
29 CUG #20 H2 UUG #28 H3 GCA #09 H2 CGU #16 H2
28 AGU #17 H3 UCA #15 H3 CCA #12 H2 GGU #06 H2
27 UAG #25 H3 CAA #28 H3 GCU #08 H2 UCC #03 H2
26 CCU #11 H2 GUA #24 H3 AGG #11 H2 UGG #12 H2
25 UGA #15 H3 GUU #26 H3 UGC #09 H2 UGC #09 H2
24 UCC #03 H2 ACU #17 H3 AGC #08 H2 UCG #13 H2
23 AUG #22 H3 UCC #03 H2 ACC #06 H2 GCU #08 H2
22 CUC #04 H2 CUA #25 H3 CUC #04 H2 GUG #19 H2
21 GAU #21 H3 AGA #14 H3 CCU #11 H2 CUC #04 H2
20 CUA #25 H3 AAC #26 H3 CUG #20 H2 CCA #12 H2
19 UAC #24 H3 UAC #24 H3 CAC #19 H2 ACC #06 H2
18 UCA #15 H3 ACA #18 H3 GGU #06 H2 CAC #19 H2
17 AAA #32 H4 CUU #27 H3 UCG #13 H2 GGA #03 H2
16 CAU #22 H3 AUG #22 H3 CGU #16 H2 AGC #08 H2
15 AUC #21 H3 GAU #21 H3 GAC #05 H2 CGA #13 H2
14 ACU #17 H3 UUC #23 H3 UCC #03 H2 AGG #11 H2
13 UGU #18 H3 UGA #15 H3 GAG #04 H2 ACG #16 H2
12 UUG #28 H3 UGU #18 H3 GUC #05 H2 CAG #20 H2
11 UAA #31 H4 AAU #30 H4 GUG #19 H2 GAC #05 H2
10 GUU #26 H3 AUA #29 H4 UGG #12 H2 GCA #09 H2
9 AAU #30 H4 CAU #22 H3 GGA #03 H2 GAG #04 H2
8 UCU #14 H3 UAU #29 H4 GCC #02 H1 GCC #02 H1
7 AUA #29 H4 AUC #21 H3 GGC #02 H1 CCG #10 H1
6 CUU #27 H3 AUU #30 H4 CGC #07 H1 CGC #07 H1
5 UAU #29 H4 UUA #31 H4 CCG #10 H1 CGG #10 H1
4 UUC #23 H3 UAA #31 H4 GCG #07 H1 GGC #02 H1
3 UUA #31 H4 UCU #14 H3 CGG #10 H1 GCG #07 H1
2 AUU #30 H4 UUU #32 H4 CCC #01 H1 GGG #01 H1
1 UUU #32 H4 AAA #32 H4 GGG #01 H1 CCC #01 H1
Simulation of height order of the 64 genomic codon distributions for domains
Order simulaton for virus simulaton for bacteria simulaton for archaea simulaton for eukaryote
H1: Hierarchy 1 H2: Hierarchy 2 H3: Hierarchy 3 H4: Hierarchy 4
Fig S3c2 Simulations of the orders of the average distribution heights for different domains, where the valuesof genomic codon distributions here are obtained from the simulation results in Fig S3b3. These simulationresults for the domains agree with the observations in Fig S3c1 based on genomic data, respectively.
76
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
−0.2−0.15
−0.1−0.05
00.05
0
1
2
3
Flu
ctu
ation
−0.06−0.04
Route bias
−0.020
0.02Hierarchy bias
E
AE
AC
BPa
BPb
BPc
BPd
BPe
BF
BAct
BCya
BChl
BSpi
BVer
BThe
Bother
VdD
VsR
VdRm
VsD
VdDB
Vsat
Fig 3D Classification of Virus (yellow), Bacteria (blue), Archaea (red) and Eukarya (green) in theevolution space, which is based on the roadmap and the complete genome sequences. Refer to FigS3d3, S3d4, S3d5 to see the two dimensional projections of this figure. The abbreviations for thetaxa are as follows: Eukarya (E), Euryarchaeota (AE), Crenarchaeota (AC), Alphaproteobacteria (BPa),Betaproteobacteria (BPb), Gammaproteobacteria (BPc), Deltaproteobacteria (BPd), Epsilonproteobacteria(BPe), Firmicutes (BF), Actinobacteria (BAct), Cyanobacteria (BCya), Chlorobi (BChl), Spirochaetes(BSpi), Chlamydiae/Verrucomicrobia (BVer), Thermotogae (BThe), other bacteria (Bother), dsDNA virus(VdD), ssRNA virus (VsR), Mammalian orthoreovirus (VdRm), ssDNA virus (VsD), Blainvillea yellow spotvirus (VdDB), satellite virus (Vsat).
77
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
20 combinations 64 permutations
Typ
es o
f gen
om
ic c
od
on
dis
tribu
tion
s
<G>
<C>
<A>
<T>
Ro
ute
0R
ou
te 1
~3
Ro
ute
0R
ou
te 1
~3
Ro
ute
0R
ou
te 1
~3
Ro
ute
0R
ou
te 1
~3
R s
tran
dY
stra
nd
R s
tran
dY
stra
nd
Hie
rarc
hy 1
~2
Hie
rarc
hy 3
~4
<G,G,G>
<G,G,A>
<G,G,C>
<G,G,T>
<G,C,A>
<C,C,C>
<C,C,T>
<C,C,G>
<C,C,A>
<G,C,T>
<A,A,A>
<A,A,G>
<A,A,T>
<A,A,C>
<A,T,G>
<T,T,T>
<T,T,C>
<T,T,A>
<T,T,G>
<A,T,C>
t1
t3, t10
t2
t12
t4, t9, t14
t7
t6
t13
t5, t11
t20
t19
t16
t17
t8, t15, t18
GGG
GGA
GGC
GGT
AGC
GAC
GAG
GCG
GTG
GCA
ACG
AGG
CGG
TGG
CAG
CGA
CCC
TCC
GCC
ACC
GCT
GTC
CTC
CGC
CAC
TGC
CGT
CCT
CCG
CCA
CTG
TCG
AAA
AAG
AAT
AAC
GAT
AGT
AGA
ATA
ACA
ATG
GTA
GAA
TAA
CAA
TGA
TAG
TTT
CTT
ATT
GTT
ATC
ACT
TCT
TAT
TGT
CAT
TAC
TTC
TTA
TTG
TCA
CTA
Fig S3d1 The 20 types of genomic codon distributions. There are respectively 20 combinations and 64permutations of the 4 bases. There is a coarse correlation between the 20 combinations of codons and the 20amino acids according to the tRNAs t1 to t20 (Fig S2a1).
78
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
#11
G
G
A
C
C
T
#14
A
G
A
T
C
T
#1
G
G
G
C
C
C
#3
A
G
G
T
C
C
#27
G
A
A
C
T
T
#32
A
A
A
T
T
T
#4
G
A
G
C
T
C
#23
A
A
G
T
T
C
#5
C
A
G
G
T
C
#26
C
A
A
G
T
T
#2
C
G
G
G
C
C
#8
C
G
A
G
C
T
#21
T
A
G
A
T
C
#30
T
A
A
A
T
T
#6
T
G
G
A
C
C
#17
T
G
A
A
C
T
#16
G
C
A
C
G
T
#18
A
C
A
T
G
T
#7
G
C
G
C
G
C
#9
A
C
G
T
G
C
#22
G
T
A
C
A
T
#29
A
T
A
T
A
T
#19
G
T
G
C
A
C
#24
A
T
G
T
A
C
#20
G
A
C
C
T
G
#28
A
A
C
T
T
G
#10
G
G
C
C
C
G
#13
A
G
C
T
C
G
#25
G
A
T
C
T
A
#31
A
A
T
T
T
A
#12
G
G
T
C
C
A
#15
A
G
T
T
C
A
Hierarchy 1 Hierarchy 2 Hierarchy 3 Hierarchy 4
Ro
ute
0
Ro
ute
1
Ro
ute
2
Ro
ute
3
Ro
ute
bia
sfo
r gen
om
ic c
od
on
dis
tribu
tion
s
Hierarchy bias
for genomic codon distributions
Fig S3d2 Explanation of the definitions of hierarchy bias and route bias. The hierarchy bias is definedby the difference between the averages of genomic codon distributions for Hierarchy 1 − 2 and that forHierarchy 3 − 4. And the route bias is defined by the difference between the averages of genomic codondistributions for Route 0 and that for Route 1 − 3. By the way, the leaf node bias is defined by the differencebetween the averages of genomic codon distributions for the leaf nodes in Route 0 − 1 and that for the leafnodes in Route 2 − 3.
79
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
−0.2 −0.15 −0.1 −0.05 0 0.05−0.08
−0.06
−0.04
−0.02
0
0.02
Hierarchy bias
Route
bia
s
E
AE
AC
BPa
BPb
BPc
BPd
BPe
BF
BAct
BCya
BChl
BSpi
BVer
BThe
Bother
VdD
VsR
VdRm
VsD
VdDB
Vsat
Fig S3d3 A two dimensional projection of the three dimensional evolution space (Fig 3D): the Hierarchybias and Route bias plane. Bacteria (blue) and Archaea (red) are distinguished by and large in the presentfigure.
80
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
−0.2 −0.15 −0.1 −0.05 0 0.05
0
1
2
3
Hierarchy bias
Flu
ctu
ation
E
AE
AC
BPa
BPb
BPc
BPd
BPe
BF
BAct
BCya
BChl
BSpi
BVer
BThe
Bother
VdD
VsR
VdRm
VsD
VdDB
Vsat
Fig S3d4 A two dimensional projection of the three dimensional evolution space (Fig 3D): the Hierarchy biasand Genomic codon distribution Fluctuation plane. Eukarya (green), Bacteria/Archaea, and Virus (yellow)are distinguished by and large in the present figure.
81
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
−0.08−0.06−0.04−0.0200.02
0
1
2
3
Route bias
Flu
ctu
ation
E
AE
AC
BPa
BPb
BPc
BPd
BPe
BF
BAct
BCya
BChl
BSpi
BVer
BThe
Bother
VdD
VsR
VdRm
VsD
VdDB
Vsat
Fig S3d5 A two dimensional projection of the three dimensional evolution space (Fig 3D): the Route biasand Fluctuation plane. Eukarya (green), Bacteria (blue), Archaea (red), and Virus (yellow) are distinguishedby and large in the present figure.
82
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
−0.06 −0.04 −0.02 0
−0.08
−0.06
−0.04
−0.02
0
0.02
Leaf node bias
Route
bia
s
E
AE
AC
BPa
BPb
BPc
BPd
BPe
BF
BAct
BCya
BChl
BSpi
BVer
BThe
Bother
VdD
VsR
VdRm
VsD
VdDB
Vsat
Fig S3d6 Distinguish Crenarchaeota (AC) and Euryarchaeota (AE) in the Leaf node bias and Route biasplane.
83
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
−0.06 −0.04 −0.02 0
0
1
2
3
Leaf node bias
Flu
ctu
ation
E
AE
AC
BPa
BPb
BPc
BPd
BPe
BF
BAct
BCya
BChl
BSpi
BVer
BThe
Bother
VdD
VsR
VdRm
VsD
VdDB
Vsat
Fig S3d7 Distinguish Crenarchaeota (AC) and Euryarchaeota (AE) in the Leaf node bias and Fluctuationplane.
84
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
001BCyanoB060BF
426BCyanoB069Z
070BChloro071BChloro
032BCyanoB482Z
481BCyanoB637BPc
220Z222BPc224BPc221BPc
223Z048BF
513BChloro
514BChloro062BF
063Z
182BChloro036BPa
242BPd
407AEuryA
235BChlofl
038BF
756BActinB
757BActinB
233BChlofl
234BChlofl
713BPe
700BF724Z078Z
720BCyanoB140Z
282BF803Z324Z
181BChloro
528BChloro
300BF
183BChloro
501BChloro
301BF
474BPc
518Z
522BCyanoB
715BCyanoB
061Z
065Z
064BF
072BChloro
243BF
212BF
230BCyanoB
229BCyanoB
163BF
411Z
415AEuryA
726Z
240BF
241BF
492BChloro
367BPc
369BPc
368BPc
370Z
818E15
258BOther
010BPc
014BPc
530Z
011BPc
012BPc
013BPc
496Z
027BPc
015BPc
314Z
316BPc
315BPc
317BPc
313BPc
361Z
211BPc
142BPe
356BF
261BF
319BPc
380Z
766BPc
767BPc
160BPc
231BChloro
486BF
379BF
002BF
446BF
459BF
760BF
761Z
762BOther
037BPa
285BChloro
386BF
049BF
051Z
050Z
059Z
052BF
053BF
067Z
066BF
055BF
068BF
054Z
058BF
056BF
172BV
erruc057B
F
228BCyanoB
583BPa
589BPa
587Z
592BPa
591BPa
588BPa
470BPa
681BF
684BF
682BF
685BF
683BF
686BF
471BPe
759BO
ther
137BO
ther516Z
462BF
519BC
yanoB493Z
525BC
yanoB
526BC
yanoB202B
F
283BT
hermo
167Z168Z
170BV
erruc
169BV
erruc
171BV
erruc
175BV
erruc
177BV
erruc
178BV
erruc
176BV
erruc752B
Spir
173Z
174BV
erruc584B
Pa
585BPa
484Z776B
Pa
208BF
376Z377Z
311BC
hloro
743BT
hermo
372Z373Z
755BC
yanoB
399AE
uryA044B
F362B
F357B
F403A
Eury
A404A
Eury
A402A
Eury
A405A
Eury
A586B
Pa
110B
Pc
114Z
113B
Pc
143Z
144B
Pe
145B
Pe
158B
F451B
F452B
F453B
F146B
Pe
147B
Pe
366B
Pd
488Z
284B
F590Z
593Z
148B
Chlo
ro503B
Therm
o149B
Chlo
ro775B
Pa
774Z
327Z
337B
Aquifi
406A
Eury
A712B
Aquifi
073B
Pa
328B
Pe
333B
Pe
329B
Pe
332B
Pe
330B
Pe
331B
Pe
141Z
363Z
326B
Pe
703B
F704B
F705B
F364B
F365B
F318B
Pc
381Z
382Z
383B
F384B
F359Z
360Z
497B
F687B
F698B
F694B
F693B
F697B
F689B
F699B
F6
96
BF
68
8B
F6
90
BF
69
5B
F6
92
BF
69
1B
F385B
F680Z
074B
Pa
075B
Pa
076B
Pa
701B
F702B
F678B
F679Z
139B
Pe
457Z
077Z
162B
Pc
447B
F449Z
151B
Pc
675Z
676B
F677B
F355B
F286B
Chlo
ro654Z
659B
F660B
F661B
F666B
F667B
F658B
F655Z
662B
F657B
F664B
F656B
F663B
F665B
F671Z
348B
F751Z
287Z
293Z
670Z
288Z
295Z
292B
Pc
294Z
289B
Pc
290Z
291B
Pc
668B
F669B
F029B
F030B
F738Z
203B
F210B
F413Z
374B
Spir
375B
Spir
412Z
138A
Cre
nA
730Z
507A
Eury
A729B
F73
1BF
248A
Cre
nA
737Z
416B
Oth
er
397Z
744Z
745B
The
rmo
746B
The
rmo
039Z
155A
Oth
erA
396A
Cre
nA
557Z
559Z
558Z
041B
Pe
450Z
157B
Pa
109Z
159B
Oth
er
414Z
398A
Eur
yA
477A
Cre
nA
112Z
150B
Pc
393B
F44
8Z25
3BPa
254B
Pa
255B
Pa25
6Z25
7Z49
0BPa
491B
Pa09
0BSp
ir
094B
Spir
095B
Spir
092B
Spir
097B
Spir
091Z
111B
Pc
204B
F29
9Z46
1Z09
3BSp
ir
096B
Spir
773Z
192Z
194Z
196B
F19
7BF
195B
F19
8BF
201B
F
205Z
207B
F
400A
EuryA
735B
Oth
er
206B
F
209B
F
199B
F
200B
F
740B
Therm
o
741BTherm
o
515BCyanoB
521BCyanoB
524BCyanoB
517Z
520BCyanoB
523BCyanoB
193BF
672ACrenA
250BOther
251BOther
401Z
709ACrenA
710Z
711Z
152BPc
161BChloro
454BF
458Z
455Z
456Z
023Z
554ACrenA
409Z
555ACrenA
739ACrenA
736ACrenA
410AEuryA
553Z
556ACrenA
777Z
040Z
733Z
734AEuryA
341ACrenA
804E1
808E5
809E6
805E2
819E16
807E4
806E3
812E9
816E13
813E10
817E14
814E11
815E12
810E7
811E8
003BPa
043BActinB
420BPa
421BPa
489BOther
483BPa
653BPa
264BPa
425BPc
582BPa
645BPa
046Z
578BPa
580BPa
577BPa
575BPa
576BPa
098BPa
100BPa
308Z309BPa
099Z245BPd
246BPd
390BPa
395Z487BPa
008BPb
045BPb
122BPb
123BPb
124BPb
125BPb
579BPa
219BActinB
534BPc
535BPc
536BPc749BPb533Z214BActinB
238BPb009BPb540BPc560BPb086Z087Z564Z028BPc325BPc088Z225BPb187Z247Z388BPa296BActinB297BActinB298BActinB601BActinB602BActinB573BActinB378Z429BActinB430Z433BActinB445BActinB434BActinB435BActinB438BActinB439BActinB673BPc674BPc021Z080BActinB084BActinB082Z083BActinB081BActinB218BActinB085BPb232BPb128BPb129BPb136BPb307Z
321AEuryA509BPb510BPb539BPc543BPc544Z541BPc542BPc089Z538BPc130BPb132BPb134BPb131BPb133BPb463Z126Z127Z545BPc561BPb763BPb562BPb563BPb115BPb116BPb117BPb118BPb121Z119BPb120BPb135BPb188BPc537BPc428Z431Z432BActinB442BActinB
441BActinB440BActinB
443BActinB
437BActinB
784BPc785BPc
786BPc779BPc
780BPc781BPc
782BPc783Z444BActinB
322AEuryA
323Z004BPc005BOther
244Z391BPc596BPa
722Z716BCyanoB
485BPa312Z472BPa
527BActinB
721BCyanoB
594BChlofl
595BChlofl
022Z179BChloro
184BChloro
464BPb465BPb
466BPb467BPb
469BPb468BPb
215Z216Z217BActinB
303BPd
742Z320BPc
418BPb
498Z352Z353BF006BA
cidoB
101Z102BPa
106Z105Z
108BPa
103BPa
104Z107B
Pa
153Z
499BPd
334BF
758AEuryA
024Z
650BA
cidoB
568BPa
342BPa
042BA
ctinB
310BPa
473BPa
723BC
yanoB
646BPa
154BF
226Z
500BC
hloro
339BPa
566BPa
567BPa
569BPa
424BPa
025BO
ther
502BF
156AEuryA
529Z
239BPd
427BF
227Z
306Z
753Z
754BSpir
338Z
727BPd
079Z
478BPb
302BPd
304BPd
305Z
408AE
uryA
394BPa
647BPa
725BPd
495BPa
648Z
007BA
ctinB
047BPa
236BD
einoc
460BPd
581BPa
651Z
652BP
a
714Z
164Z
165B
Pa
252B
Pa
494B
Pa
237Z
778B
Pa
600B
Chlo
ro
371Z
422B
Pa
732B
Actin
B
419B
Pa
190Z
191B
Actin
B
347B
Actin
B
707Z
708Z
480B
Actin
B
417B
Pb
479Z
423B
Pa
599B
Actin
B
706Z
504B
Pa
570Z
571B
Pa
572B
Pa
597B
Actin
B
016B
Pc
392B
Pc
505B
Pc
770B
Pc
552B
Pc
768B
Pc
031B
Pc
627B
Pc
632B
Pc
625B
Pc
634B
Pc
636B
Pc
531B
Pc
633B
Pc
769Z
511B
Pb
512Z
750B
Pc
186B
Chlo
ro
340B
Pc
598B
Pc
620B
Pc
630B
Pc
629B
Pc
764Z
765B
Pc
349B
F350Z
351B
F621B
Pc
622B
Pc
623B
Pc
624B
Pc
631Z
771B
Pc
772B
Pc
791Z
79
2B
Pc
79
7B
Pc
79
3B
Pc
79
4B
Pc
79
5B
Pc
79
6B
Pc
798B
Pc
800B
Pc
799B
Pc
801B
Pc
802B
Pc
017B
Pc
018B
Pc
019B
Pc
389B
Pc
532B
Pc
551B
Pc
020B
Pc
358Z
626B
Pc
549Z
550B
Pc
475Z
476Z
506Z
635B
Pc
026B
Pc
166B
Pc
619B
Pc
213Z
387B
Oth
er
180B
Chlo
ro
354B
F
628B
Pc
189B
Pc
259B
Pc
263Z
335B
Pb
343B
Pb
649B
Pc
262Z
265Z
267B
Pc
268B
Pc
266B
Pc
279B
Pc
270B
Pc
281B
Pc
269Z
271B
Pc
273Z
272Z
280Z
278B
Pc
277B
Pc
274Z
275B
Pc
276B
Pc
641Z
642B
Pc
643B
Pc
603Z
604Z
614B
Pc
607B
Pc
617Z
609B
Pc
611B
Pc
605Z
606B
Pc
612B
Pc
608B
Pc
613B
Pc
615B
Pc
616Z
610B
Pc
639B
Pc
644B
Pc
638B
Pc
618B
Pc
640Z
787Z
788B
Pc
789B
Pc
790B
Pc
336B
Chl
ofl
260B
Pc
345B
Pc
346Z546Z548Z547Z
574B
Pb
185B
Chl
ofl
565B
Act
inB
717B
Cya
noB
718B
Cya
noB
719B
Cya
noB
249B
Pc
436Z508Z
873V
dD
909V
dD
931V
dD
728B
Pb
875V
dD
919V
dD
825V
dD
866V
dD
895V
dD
896V
dD
950V
dD
951V
dD
827V
dD
821V
dD
823V
dD
822V
dD
824V
dD
826V
dD
901V
dD
929V
dD
925V
dD
927V
UP
834V
UP
913V
dD
936V
sR
928V
dD
940V
dD
935V
sR
937V
sR
902V
dD
941VdD
946VdD
908VdD
033BPd
035BPd
344BAct
inB
034BPd
887VdD889VdD
890VdD891VdD
888VdD881VdD
918VdD920VdD
921VdD900VdD
907VdD882VdD
884VdD915VdD
917VdD916VdD
867VdD870VdD
869VdD871VdD
868VdD906VdD
938VdD883VdD
874VdD914VdD
922VdD
747BDeinoc
748BDeinoc820VUC
880VsRe
897VUC855VdRm
832VsDm
876VsD905VsD
o939VUC933VsR
828VdDB830VsDc
863VdRm947VsR833VsD948VdRr837VdRm949VdRr
904VsDo911VsDcl930VsD835VUC952VSat893VsD899VsD846VdRm848VdRm894VRT924VdD945VsR903VsD923VsD926VdD934VSat829VdDB845VdRm864VdRm912VsDcl836VdRm856VdRm857VdRm847VdRm839VdRm862VdRm838VdRm842VdRm851VdRm859VdRm843VdRm849VdRm942VsD943VsD944VsR840VdRm854VdRm853VdRm860VdRm852VdRm861VdRm841VdRm858VdRm850VdRm
885VsRb844VdRm865VdRm886VsRb831VsD932VsR
878VsRe879VsRe898VUC892VsR872VUP910VdD877VUV
Fig S3d8 Tree of species based on the distances among species in the nondimentionalized evolution spaceFig 3D. This tree agrees with the three-domain tree by and large. Please enlarge to see details. Refer to thelegend of Fig 3D in the appendix to see the abbreviations of the taxa.
85
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
1. Virus2. Bacteria
4. Eukarya
3. Crenarchaeota
3. Euryarchaeota
Fig S3d9 Tree of Bacteria, Crenarchaeota, Euryarchaeota, Eukarya and Virus based on the average distancesamong these taxa in the nondimentionalized evolution space Fig 3D. The origin order of these taxa is obtainedbased on the result in Fig 3C. This tree agrees with the three-domain tree rather than the eocyte tree.
86
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
Bee
Dog
Zebrafish
Thalecress
Beetle
Nematode
Cow
Human
Chimpanzee
M. Mulatta
Rat
Opossum
Mouse
Horse
Chicken
Cerevisiae
Fig S3d10 Tree of eukaryotes based on their distances in the nondimentionalized evolution space Fig 3D.This tree is considerably reasonable.
87
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
600 550 500 450 400 350 300 250 200 150 100 50 0
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Logarith
m o
f num
ber
of genera
Cm O S D C P Tr J K Pg NEdiacaran
Time (Ma)
Trend of genome size evolution: Ngenome
(t) = exp( − t ⋅ log2 / + 21.5 )
Trend of biodiversity evolution: Ngenus
(t) = exp( − t ⋅ log2 / + 8 )
177.8
172.7
≈
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Origin
al lo
garith
mic
mean o
f genom
e s
ize G
orilo
gM
ean
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Fig 4A Explanation of the trend of the Phanerozoic biodiversity curve by the trend of genome size evolution.The exponential growth rate of the Phanerozoic biodiversity curve is about equal to the exponential growthrate of genome size evolution (indicated in red). Refer to Fig S4a7 and Fig S4a8 to see the exponential trendof the genome size evolution and the exponential trend of the Phanerozoic biodiversity, respectively.
88
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0
10
20
30
40
50
60
70
80
genome size (base pairs)
Nu
mb
er
of
sp
ecie
s
107
108
109
1010
1011
GlogMean
Database←
Angiosperm
Gymnosperm
Pteridophyte
Bryophyte
Deuterostomia
Protostomia
Diploblostica
Fig S4a1 Logarithmic normal distribution of genome sizes for taxa. The genome size distribution for allthe species in the two databases satisfies logarithmic normal distribution very well, due to the additivity ofnormal distributions. The logarithmic mean of genome sizes for all the species in the two databases is markedin the present figure, which represents the intercept of the trend of genome size evolution in Fig S4a7.
89
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
time t
logarithm of genome size
Simulation of genome size evolution: Taxon A
tA
origin
tA
GA
orilogMeanG
A
logMean ± G
A
logSD
Ancestor ofTaxon A Taxon A
Simulation of genome size evolution: Taxon B
tB
origin < t
A
origin
tB
GB
orilogMeanG
B
logMean ± G
B
logSD
GA
orilogMeanG
A
logMean G
A
logSD
< = >
Ancestor ofTaxon B Taxon B
Fig S4a2 Simulation of the evolution of genome size. In the simulation, the genome size statistically doublesin the stochastic process model, which results in the logarithmic genome size distributions. The invarianceof the genomic codon distributions after genome duplication is considered in the model (Fig S3a4). Twotaxa A and B are compared in the simulation. The greater the logarithmic standard deviation of genome sizesin a taxon is (GB
logS D > GAlogS D), the earlier the taxon originated (tB
origin <tAorigin). Besides, the less the original
logarithmic mean of genome size is (GBorilogMean < GA
orilogMean), the earlier the taxon originated (tBorigin <tA
origin).
90
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
16
17
18
19
20
21
22
23
24
Lo
ga
rith
mic
me
an
of
ge
no
me
siz
e G
logM
ean
Logarithmic standard deviation of genome size GlogSD
GlogMean
for taxa
GorilogMean
for taxa
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
16
17
18
19
20
21
22
23
24
Orig
ina
l lo
ga
rith
mic
me
an
of
ge
no
me
siz
e G
orilo
gM
ean
Fig S4a3 The greater the logarithmic standard deviation of genome sizes GlogS D is, the less the originallogarithmic mean of genome sizes GorilogMean is, and meanwhile also the less the logarithmic mean of genomesizes GlogMean is. These statistical analyses of genome sizes for the taxa are obtained based on the genomicdata in the two databases for animals and plants.
91
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
13 14 15 16 17 18 19 20 21 22 23 24 25 26 2713
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Archeae
Bacteria
Algae&Protists
Diploblostica
Protostomia
Deuterostomia
Bryophyte
Pteridophyte
Gymnosperm
Angiosperm
GlogMean
± χ * GlogSD
for taxa
GlogMean
± χ * GlogSD
for lower taxa
minimum through maximum
predicted upper limit of logarithmic genome size
Original logarithmic mean of genome size GorilogMean
Lo
ga
rith
mic
va
ria
tio
n o
f g
en
om
e s
ize
Fig S4a4 The less the original logarithmic mean of genome sizes GorilogMean is, the less the logarithmic meanof genome sizes GlogMean is (red), and meanwhile the greater the logarithmic standard deviation of genomesizes GlogS D is (green or blue). The original logarithmic mean of genome sizes GorilogMean indicate the origintime of taxon (Fig S4a2). The convergent point (namely the upper vertex of the triangle area in the presentfigure) indicates the upper limit of logarithmic genome sizes at present.
92
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
No. Superphylum Taxon number of records GlogMean
GlogSD
GorilogMean
1 Protostomia Nematodes 66 −2.394 0.93204 −3.8552
2 Deuterostomia Chordates 5 −1.8885 0.91958 −3.3301
3 Diploblostica Sponges 7 −1.0834 1.3675 −3.2272
4 Diploblostica Ctenophores 2 −0.010305 1.6417 −2.584
5 Protostomia Tardigrades 21 −1.2168 0.7276 −2.3574
6 Protostomia Misc_Invertebrate 57 −0.75852 0.96321 −2.2686
7 Protostomia Arthropod 1284 −0.078413 1.2116 −1.9778
8 Protostomia Annelid 140 −0.14875 0.9258 −1.6001
9 Protostomia Myriapods 15 −0.54874 0.66478 −1.5909
10 Protostomia Flatworms 68 0.15556 1.0701 −1.522
11 Protostomia Rotifers 9 −0.51158 0.55134 −1.3759
12 Diploblostica Cnidarians 11 −0.16888 0.69379 −1.2565
13 Deuterostomia Fish 2045 0.23067 0.6559 −0.7976
14 Deuterostomia Echinoderm 48 0.11223 0.52794 −0.71542
15 Protostomia Molluscs 263 0.5812 0.5493 −0.27994
16 Deuterostomia Bird 474 0.32019 0.13788 0.10403
17 Deuterostomia Reptile 418 0.78332 0.28332 0.33916
18 Deuterostomia Amphibian 927 2.4116 1.081 0.71691
19 Deuterostomia Mammal 657 1.1837 0.2401 0.80727
Fig S4a5 The three animal superphyla Diploblostica, Protostomia and Deuterostomia have been roughlydistinguished based on the order of the original logarithmic means of genome sizes GorilogMean for the animaltaxa (Fig S4a4). This supports that the order of GorilogMean indicate the evolutionary chronology.
93
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
No. Class Taxon GlogMean
GlogSD
GorilogMean
1 Dicotyledoneae Lentibulariaceae −1.0532 0.88349 −2.43822 Monocotyledoneae Cyperaceae −0.81211 0.61307 −1.77323 Dicotyledoneae Cruciferae −0.62192 0.6855 −1.69664 Dicotyledoneae Rutaceae −0.22121 0.93413 −1.68565 Dicotyledoneae Oxalidaceae 0.19774 1.1445 −1.59646 Dicotyledoneae Crassulaceae −0.26578 0.82268 −1.55557 Dicotyledoneae Rosaceae −0.40468 0.62511 −1.38478 Dicotyledoneae Boraginaceae −0.20664 0.68081 −1.27399 Dicotyledoneae Labiatae −0.0021905 0.80883 −1.270210 Monocotyledoneae Juncaceae −0.23032 0.63698 −1.228911 Dicotyledoneae Vitaceae −0.60987 0.39049 −1.22212 Dicotyledoneae Cucurbitaceae −0.26487 0.60779 −1.217713 Dicotyledoneae Onagraceae 0.040848 0.78018 −1.182214 Dicotyledoneae Leguminosae 0.33968 0.88684 −1.050615 Dicotyledoneae Myrtaceae −0.37801 0.42511 −1.044516 Monocotyledoneae Bromeliaceae −0.56838 0.29232 −1.026617 Dicotyledoneae Polygonaceae 0.20985 0.76174 −0.9843318 Dicotyledoneae Euphorbiaceae 0.72687 1.0796 −0.9656119 Dicotyledoneae Convolvulaceae 0.50052 0.928 −0.954320 Dicotyledoneae Chenopodiaceae −0.046809 0.5526 −0.9131221 Dicotyledoneae Plantaginaceae −0.15021 0.48422 −0.9093222 Dicotyledoneae Rubiaceae −0.084413 0.51565 −0.892823 Dicotyledoneae Caryophyllaceae 0.27683 0.65869 −0.755824 Dicotyledoneae Amaranthaceae 0.15834 0.58176 −0.7536925 Dicotyledoneae Malvaceae 0.39517 0.47109 −0.3433626 Monocotyledoneae Zingiberaceae 0.24819 0.36317 −0.3211527 Monocotyledoneae Iridaceae 1.3429 1.0491 −0.3017828 Dicotyledoneae Umbelliferae 0.71003 0.6235 −0.2674229 Dicotyledoneae Solanaceae 0.78034 0.66585 −0.2635230 Monocotyledoneae Orchidaceae 1.4063 1.0551 −0.2478431 Monocotyledoneae Araceae 1.5174 1.012 −0.06915232 Dicotyledoneae Papaveraceae 0.93206 0.61932 −0.03885433 Dicotyledoneae Compositae 1.0741 0.70726 −0.03465734 Monocotyledoneae Gramineae 1.4002 0.84476 0.07589435 Dicotyledoneae Cactaceae 0.98813 0.57251 0.09060836 Monocotyledoneae Palmae 1.1222 0.63488 0.1269137 Dicotyledoneae Passifloraceae 0.52209 0.22472 0.1697938 Dicotyledoneae Orobanchaceae 1.12 0.54393 0.2672639 Dicotyledoneae Cistaceae 0.88905 0.30458 0.4115540 Monocotyledoneae Asparagaceae 2.0053 0.78802 0.7699141 Dicotyledoneae Asteraceae 1.8795 0.67031 0.8286342 Dicotyledoneae Ranunculaceae 2.0285 0.72517 0.891643 Monocotyledoneae Agavaceae 1.6207 0.4537 0.9094144 Monocotyledoneae Hyacinthaceae 2.3635 0.69028 1.281445 Dicotyledoneae Loranthaceae 2.3797 0.68478 1.306246 Monocotyledoneae Commelinaceae 2.5322 0.64196 1.525847 Monocotyledoneae Amaryllidaceae 2.9085 0.5811 1.997548 Monocotyledoneae Xanthorrhoeaceae 2.7036 0.4 2.076549 Monocotyledoneae Asphodelaceae 2.8054 0.33968 2.272950 Monocotyledoneae Alliaceae 2.9051 0.39078 2.292451 Dicotyledoneae Paeoniaceae 2.957 0.28164 2.515552 Monocotyledoneae Liliaceae 3.5678 0.63278 2.575753 Monocotyledoneae Aloaceae 2.9724 0.2203 2.627
Fig S4a6 The two angiosperm classes Dicotyledoneae and Monocotyledoneae have been roughlydistinguished based on the order of the original logarithmic means of genome sizes GorilogMean for theangiosperm taxa (Fig S4a4). This also supports that the order of GorilogMean indicate the evolutionarychronology.
94
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
600 550 500 450 400 350 300 250 200 150 100 50 0
19
20
21
22
23
Lo
ga
rith
mic
me
an
of
ge
no
me
siz
e G
logM
ean
Diploblostica
Protostomia
Deuterostomia
Bryophyte
Pteridophyte
Gymnosperm
Angiosperm
Trend of GlogMean
Trend of GorilogMean
GorilogMean
= GlogMean
− χ ⋅ GlogSD
GlogMean
Database→such that
Cm O S D C P Tr J K Pg NEdiacaran
Time (Ma)
19
20
21
22
23
Orig
ina
l lo
ga
rith
mic
me
an
of
ge
no
me
siz
e G
orilo
gM
ean
19
20
21
22
23
19
20
21
22
23
19
20
21
22
23
19
20
21
22
23
19
20
21
22
23
19
20
21
22
23
Fig S4a7 The exponential trend of genome size evolution based on the relation between the origin time oftaxa and their original logarithmic means of genome sizes. The red regression line of GlogMean against theorigin time should be shifted downwards to the blue regression line of GorilogMean against the origin time,by considering the inverse relationship between GlogS D and the origin time (Fig 4a2). The parameter χ inthe definition of GorilogMean is determined by letting the intercept of the blue regression line be the averagelogarithmic genome size GDatabase
logMean (Fig 4a1).
95
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−2
−1
0
1
2
3
4
5
6
7
8
9
Time (Ma)
Lo
ga
rith
m o
f n
um
be
r o
f g
en
era
Cm O S D C P Tr J K Pg N
Phanerozoic biodiversity curve = Fluctuation of biodiversity × Trend of biodiversity
Fluctuation of biodiversity
Trend of biodiversity: Ngenus
(t) = exp( − t ⋅ log2 / 172.7 + 8 )
500 450 400 350 300 250 200 150 100 50 0
1
10
100
1000
10000
Nu
mb
er
of
ge
ne
ra
Fig S4a8 The exponential trend of Bambach et al.’s Phanerozoic biodiversity curve. The unnormalisedbiodiversity net fluctuation curve (thin light green) is obtained by subtracting the linear trend of logarithmicbiodiversity curve (thin dark green) from the logarithmic biodiversity curve (thick dark green).
96
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−3
−2
−1
0
1
2
3
Time (Ma)
Rela
tive v
ariation
Cm O S D C P Tr J K Pg N
O−
S
F−
F
P−
Tr
Tr−
J
K−
Pg
G−
L
Biodiversity
Climato−eustatic curve
Fig 4B Explanation of the fluctuations of the Phanerozoic biodiversity curve based on climate change andsea level fluctuations, which indicates the tectonic cause of the mass extinctions. Refer to Fig S4b4 and FigS4a8 to see the climato-eustatic curve and the biodiversity net fluctuation curve, respectively.
97
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
Cm O S D C P Tr J K Pg N
Time (Ma)
Clim
atic g
rad
ien
t b
ase
d o
n c
lima
te in
dic
ato
rsLow
Mod
Mod H
iH
iV
ery
Hi
Warm
Cool w
inte
rsF
reezin
g w
inte
rsU
nip
ola
r gla
cia
tion
Bip
ola
r gla
cia
tion
Te
mp
era
ture
Fig S4b1 Climatic gradient curve in the Phanerozoic eon based on climate indicators. The climatic gradientcurve is opposite to the climate curve.
98
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−4
−3
−2
−1
0
1
2
3
4
Time (Ma)
Re
lative
te
mp
era
ture
average Climate change
Cm O S D C P Tr J K Pg N4
3
2
1
0
−1
−2
−3
−4
Re
lativ
e c
lima
te g
rad
ien
t
Climate indicators
CO2
87Sr/
86Sr
δ18
O
Fig S4b2 The Phanerozoic climate curve is obtained by averaging the four independent results (climateindicators, Berner’s CO2, marine 87S r/86S r and marine δ18O).
99
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−4
−3
−2
−1
0
1
2
3
4
Time (Ma)
Re
lative
se
a le
ve
l
average Sea level
Cm O S D C P Tr J K Pg N
Hallam
Haq
Fig S4b3 The Phanerozoic eustatic curve is obtained by averaging the Hallam’s and Haq’s results on sealevel fluctuations.
100
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−4
−3
−2
−1
0
1
2
3
4
Time (Ma)
Re
lative
va
ria
tio
n
average Climato−eustatic curve
Cm O S D C P Tr J K Pg N
Sea level
Climate gradient
Fig S4b4 The climato-eustatic curve is obtained by averaging the Phanerozoic climate gradient curve (FigS4b2) and the Phanerozoic eustatic curve (Fig S4b3).
101
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−4
−3
−2
−1
0
1
2
3
4
Time (Ma)
Re
lative
va
ria
tio
n
Cm O S D C P Tr J K Pg N
Sea level
Climate gradient
Biodiversity
Fig S4b5 Relationships among the eustatic curve (Fig S4b3), the climate gradient curve (Fig S4b2) and thebiodiversity net fluctuation curve (Fig S4a8) in the Phanerozoic eon.
102
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−0.4
−0.2
0
0.2
0.4
Time (Ma)
Va
ria
tio
n r
ate
s
Cm O S D C P Tr J K Pg N
Sea level variation rate
Climate gradient variation rate
Biodiversity variation rate
Fig S4b6 Variation rate of the eustatic curve, variation rate of the climate gradient curve and variation rateof the biodiversity net fluctuation curve in the Phanerozoic eon (the derivative curves of the three curves inFig S4b5).
103
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−0.4
−0.2
0
0.2
0.4
Time (Ma)
Va
ria
tio
n r
ate
s
Cm O S D C P Tr J K Pg N
O−
S
F−
F
P−
Tr
Tr−
J
K−
Pg
G−
L
Climato−eustatic variation rate
origination rate − extinction rate
Biodiversity variation rate
Fig S4b7 Variation rate of the climato-eustatic curve and variation rate of the biodiversity net fluctuationcurve in the Phanerozoic eon (the derivative curves of the two curves in Fig 4B). The five mass extinctions(including two phases in the P-Tr extinction) are marked in the present figure (Fig 4B), and the minorextinctions can also be observed in the variation rate of the climato-eustatic curve. Additionally, the differencebetween the origination rate and the extinction rate is plotted (Fig S4d3).
104
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
0
1000
2000
3000
4000
5000
6000
Time (Ma)
Num
ber
of genera
Cm O S D C P Tr J K Pg N
Predicted biodiversity curve
Phanerozoic biodiversity curve
All genera in Sepkoski’s Compendium
Well−resolved genera in Sepkoski’s Compendium
Fig 4C Reconstruction of the Phanerozoic biodiversity curve based on genomic, climatic and eustatic data,whose result agrees with the biodiversity curve based on fossil records. The Bambach et al.’s Phanerozoicbiodiversity curve (Fig S4a8) based on fossil records is in green, and the predicted biodiversity curve in pinkis based on the climato-eustatic curve in Fig S4b4 and the exponential trend of genome size evolution in FigS4a7. Refer to Fig S4c1 for details.
105
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−2
−1
0
1
2
3
4
5
6
7
8
9
10
Time (Ma)
Lo
ga
rith
m o
f n
um
be
r o
f g
en
era
Cm O S D C P Tr J K Pg N
Phaneroazoic biodiversity curve = Fluctuation of biodiversity × Trend of biodiversity
Fluctuation of biodiversity
Trend of biodiversity: Ngenus
(t) = exp( − t ⋅ log2 / 172.7 + 8 )
Predicted biodiversity curve = Predicted fluctuation of biodiversity × Predicted trend of biodiversity
Predicted fluctuation of biodiversity = Climato−eustatic curve × standard deviation of Fluctuation of biodiversity
Predicted trend of biodiversity: Ngenus
predict(t) = exp( − t ⋅ log2 / 177.8 + 8 )
Climate−eustatic curve
500 450 400 350 300 250 200 150 100 50 0
1
10
100
1000
10000
Nu
mb
er
of
ge
ne
raFig S4c1 Detailed explanation of the reconstruction of the Phanerozoic biodiversity curve based on genomic,climatic and eustatic data (Fig 4A, 4B, 4C). The exponential growth trend of the predicted biodiversity curve(obtained based on the exponential trend in genome size evolution in Fig S4a7) corresponds to the exponentialgrowth trend in the Phanerozoic biodiversity curve (obtained based on the fossil records in Fig S4a8); Thefluctuations of the predicted biodiversity curve (obtained based on the climato-eustatic curve in Fig S4b4)corresponds to the biodiversity net fluctuation curve (obtained based on the fossil records in Fig S4a8).
106
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
0
0.2
0.4
0.6
0.8
1
Time (Ma)
Origin
ation r
ate
Cm O S D C P Tr J K Pg N
Predicted origination rate
Phanerozoic origination rate
Fig 4D Explanation of the declining origination rate through the Phanerozoic. Refer to Fig S4d3 to seethe declining origination rate and extinction rate based on fossil records; and refer to Fig S4d2 to see theexplanation of these declining rates.
107
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
0
100
200
300
Time (Ma)
Orig
ina
tio
n o
f g
en
era
Cm O S D C P Tr J K Pg N
Predicted origination rate × Predicted trend of biodiversity
Constraint of origination ability
Fig S4d1 Constraint of origination ability of life on Earth, which results in the declining origination ratethrough the Phanerozoic eon (Fig 4D).
108
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
Time (Ma)
Pre
dic
ted
va
ria
tio
n r
ate
Cm O S D C P Tr J K Pg N
Predicted origination rate = c’ (Climato−eustatic variation rate + c’’) / Predicted trend of biodiversity
Predicted extinction rate
Predicted origination rate − Predicted extinction rate
Fig S4d2 Explanation of the declining origination rate and extinction rate in the Phanerozoic eon, accordingto the reconstruction of the Phanerozoic biodiversity curve (Fig S4c1). The declining rates are owing to theexponential growth trend driven by the genome size evolution.
109
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint
500 450 400 350 300 250 200 150 100 50 0
−0.2
0
0.2
0.4
0.6
0.8
Time (Ma)
Va
ria
tio
n r
ate
Cm O S D C P Tr J K Pg N
Phanerozoic origination rate
Phanerozoic extinction rate
Phanerozoic origination rate − Phanerozoic extinction rate
Fig S4d3 The declining origination rate and extinction rate in the Phanerozoic eon based on fossil records,which can be explained in Fig 4D and Fig S4d2.
110
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 13, 2015. . https://doi.org/10.1101/017988doi: bioRxiv preprint