supplemental information phylogenomics resolves a · pdf filesupplemental information...

15
Current Biology, Volume 24 Supplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for Orb Web Evolution Jason E. Bond, Nicole L. Garrison, Chris A. Hamilton, Rebecca L. Godwin, Marshal Hedin, Ingi Agnarsson

Upload: trinhdung

Post on 10-Feb-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

Current Biology, Volume 24

Supplemental Information

Phylogenomics Resolves a Spider Backbone

Phylogeny and Rejects a Prevailing

Paradigm for Orb Web Evolution

Jason E. Bond, Nicole L. Garrison, Chris A. Hamilton, Rebecca L. Godwin, Marshal

Hedin, Ingi Agnarsson

Page 2: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

Fig. S1, related to Figure 2. Maximum likelihood (ML) tree topology (-985426.78), based on supermatrix analysis of 128 putative orthologs, showing relative support values for Bayesian inference (BI; -967007.27 and -966991.37 for runs 0 and 1 respectively), ML, and parsimony (PA; 156,587 steps) analyses. Filled blocks denote BI/ML/PA bootstrap (bs) values = 100% and posterior probabilities (pp) = 1.0, otherwise exact values are indicated at each node (pp/ML-bs/PA-bs). Results of gene-tree/species-tree analyses (MP-EST, STAR, NJst) for this data matrix presented on the following pages; numbers on the tree indicate bootstrap support values for corresponding nodes on the estimated species tree. The length of tree does not represent branch lengths, but the number of times that each group appeared in the input bootstrap trees.

Page 3: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,
Page 4: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,
Page 5: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,
Page 6: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,
Page 7: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

Fig. S2, related to Figure 3. Ancestral state reconstruction of A. Genitalic condition; B. Orb web presence/absence; C. Web type; D. Setae type; E. Aggregate glands presence/absence; F. Paracymbium presence/absence. Phylogenetic tree showing the maximum likelihood ancestral state reconstruction for each character across the major spider lineages. Pie charts at nodes denote the relative likelihood that an ancestor had a particular state (legend inset).

Page 8: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

Table S3, related to Figures 1 & 2. Approximately Unbiased (AU) test results for alternative hypotheses of spider relationships

Alternative Hypothesis Ln Likelihood Score (difference) P-Value Significantly

different Best tree - 1865172.01 - - Orbiculariae + Oecobiidae -1865404.36 (232.3) 2e-05 Yes

Mygalomorph relationships sensu Bond et al. 2012

-1865664.69 (492.7) 3e-04 Yes

Orbiculariae monophyly -1865958.88 (786.9) 7e-51 Yes Araneomorph relationships sensu Coddington 2005

-1869416.11 (4244.1) 8e-07 Yes

Mygalomorph relationships sensu Raven 1985

-1891608.58 (26436.6) 1e-43 Yes

AU tests performed in CONSEL using matrix of 327 genes corresponding to the phylogeny (best tree) depicted in Figure 2. Per site log likelihoods were re-estimated for each tree in RAxML (-f G option) to produce the individual matrices for AU test statistic calculations.

Page 9: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

Table S4, related to Figure 4. Chronogram Calibrations and Divergence Time Estimates Node Number Minimum/Maximum Age Estimated Age of Node (mya)

1 - Mesothelae 299/400 387 (253-520) 2 - 384 (252-516) 3 - Extant Mygalomorphs 242/386 327 (215-440) 4 - 189 (124-254) 5 - 149 (98-201) 6 - 138 (90-185) 7 - Nemesiids 125/299 128 (84-172) 8 - 92 (60-124) 9 - 117 (76-157) 10 - 109 (71-146) 11 - 66 (43-90) 12 - 6 (3-8) 13 - 84 (53-114) 14 - 111 (71-150) 15 - 53 (34-71) 16 - 162 (106-218) 17 - Antrodiaetids 113/299 144 (94-193) 18 - 60 (39-81) 19 - Extant Araneomorphs 228/386 344 (226-462) 20 - 201 (132-270) 21 - 187 (123-251) 22 - Cribellate Orb Web 161/299 173 (114-232) 23 - 161 (106-217) 24 - 137 (90-185) 25 - 114 (75-154) 26 - 91 (60-123) 27 - 90 (59-121) 28 - 78 (51-105) 29 - 57 (37-76) 30 - 155 (102-208) 31 - 134 (88-180) 32 - araneids + Nesticidae/Linyphiidae 125/299 125 (82-167) 33 - 84 (55-113) 34 - 107 (70-144) 35 - 99 (64-133) 36 - 92 (60-124) 37 - Haplogynae + Hypochilidae 173/299 299 (196-401) 38 - 229 (150-308) 39 - 256 (167-344)

Page 10: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

Supplemental Experimental Procedures Overview. RNA was extracted from 39 field-collected animals and used to create cDNA libraries for transcriptome sequencing. Transcriptomes were assembled and processed using a bioinformatics pipeline that screened for data quality and used the HaMStR [S1] approach to select orthologous genes for phylogenetic inference. Concatenated and single gene matrices were analyzed using maximum likelihood implemented in RAxML [S2]; gene trees were compared using three partially parametric approaches. The partitioned concatenated matrix was also evaluated using TNT [3] and ExaBayes (http://sco.h-its.org/exelixis/web/software/exabayes/index.html). Divergence times were estimated using RelTime [S4], alternative hypotheses were evaluated using Approximately Unbiased (AU) tests, and ancestral character state reconstructions using maximum likelihood. All supporting data matrices, tree files, and analysis information files have been deposited in the Dryad Digital Repository at doi: XXXXXXXX. Taxon Sampling, RNA Extraction, and Sequencing. Exemplar taxa were selected to represent the total breadth of spider phylogenetic diversity. Outgroups included the non-spider arachnids Hesperochernes sp. (Pseudoscorpiones) and Ixodes scapularis (Acari), and the crustacean Daphnia pulex, the organism most closely allied with Arachnida for which there was a reference proteome in the InParanoid database. For most OTUs (for exceptions see Supplementary Table 3), total RNA was extracted from flash frozen or RNAlater® (Invitrogen Life Technologies) preserved tissues. Specimens were initially collected and transferred, still living, to the lab where they were flash frozen and stored at -80oC until extraction. Spider tissue was taken only from the prosoma and legs when possible, and full body extractions were performed for smaller specimens. Tissues were disrupted with a mortar and pestle and homogenized with a rotor stator homogenizer. RNA was extracted using the TRIzol® reagent (Invitrogen Life Technologies) total RNA isolation protocol. Extracted RNA was further purified using the RNeasy Mini Kit (Qiagen) and samples were quantified using a NanoDrop 2000 (Thermo). After quantification, samples were sent to the Genomic Services Lab at the HudsonAlpha Institute for Biotechnology (Huntsville, AL) for cDNA library preparation and RNA sequencing with Poly(A) selection (RNA-seq). Samples, unless otherwise noted, were sequenced on the second generation Illumina HiSeq 2000 platform (25 million, 50bp paired end reads). Voucher specimens are available from the Auburn University Museum of Natural History collection. Sequence data are available from the NCBI short-read archive (SRA) (accessions in progress). Sequence processing. Raw Illumina reads were filtered using a series of custom bash and python scripts before being assembled de novo via the Trinity pipeline[S5]. The FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html) was used to remove low quality reads (Phred <20); reads were trimmed and re-synced using a custom python script (sync_paired_end_reads.py– https://github.com/martijnvermaat/bio-playground.git). Quality of filtered and synced fastq files was assessed with the program FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), fed to the Trinity pipeline, and assembled without a reference genome. Finished transcriptome assemblies were subjected to HaMStR (Hidden Markov Model based Search for Orthologs using

Page 11: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

Reciprocity)[S1] analysis with Daphnia pulex as the arthropod reference proteome. HaMStR uses a known set of “core orthologs” to train Hidden Markov Models (HMMs) then employs those trained models to identify matching patterns in a provided set of EST or transcriptome sequences. Putative orthologs identified in this manner were combined for all spider OTUs and outgroups (n=43) and processed with a custom ortholog filtering and alignment bash script following Brewer and Bond [S6]. Ortholog sets having representative sequences from < 20 and < 35 of the 42 OTUs were eliminated along with any ortholog sets < 100 AA residues long. After the initial filtering stages, ortholog sets were passed to the program MAAFT [S7] for alignment, ScaFoS [S8] for phylogenetic optimization of alignments (removal of potential paralogs and reduction of missing data) and then to a series of programs (Gblocks [S9], Aliscore [S10] & Alicut (http://zfmk.de/web/ZFMK_Mitarbeiter/KckPatrick/Software/AliCUT/Download/index.en.html)) for trimming of alignments. Another round of filtering based on the minimum OTU and minimum sequence length settings preceded the final concatenation and summary of the data matrices via the program FASconCAT [S11]. The program BaCoCa [S12] was employed to examine the data for biases in amino acid composition and evaluate levels of missing data for each partition and OTU – two potential sources of systematic error in downstream phylogenomic inferences. Phylogenetic analyses. Concatenated phylogenetic analyses were carried out using three optimality criteria – maximum likelihood (ML), Bayesian inference (BI), and parsimony (PA). For ML analyses, alternate substitution models were explored using the RAxML [S2] model selection Perl script for each individual gene. The analysis identified the DAYHOFF substitution model for all but several genes. Subsequent analyses of the partitioned concatenated matrix recovered identical tree topologies but with far superior likelihood scores using the PROTGAMMAWAG model for each partition (findings consistent with empirical studies of Whelan and Goldman [S13]); RAxML analyses reported herein employ that model for partitioned and concatenated analyses. We ran two separate sets of initial ML analyses using the bioinformatics pipeline (described above) with the minimum taxon sampling per locus set at 20 and 35 to evaluate the concomitant effects of adding loci but additional missing cells in the matrix. Individual gene trees were first produced for each data set using ML and then inspected visually to remove alignments (loci) that contained obvious paralogs. The resulting 327 and 128 gene supermatrices were analyzed via ML using 500 random addition sequence replicates (RAS) and 1000 and 500 bootstrap (BS) replicates (respectively) for comparison. To accommodate potential gene tree heterogeneity, we estimated species trees from 128 individual gene trees and their 100 BS replicates. Individual gene trees were inferred using RAxML with 100 RAS and 100 full BS replicates and then evaluated using STAR [S14], MP-EST [S15], and NJst [S16]. All three species tree approaches can handle missing data and are more computationally feasible for large phylogenomics data sets than Bayesian methods that simultaneously estimate species and gene trees. First, STAR (Species Trees based on Average Ranks of coalescences) computes the topological distances among pairs of taxa by calculating the average of the ranks (number of nodes from the root) across pairs of taxa for all nodes in gene trees. This method assumes that any discrepancy between gene trees and the species tree is due to deep coalescence and that no recombination exists within each gene. For bootstrap datasets, STAR calculates a

Page 12: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

consensus tree after evaluating all ranks of coalescences from all bootstrap replicates. It is thought to perform well when gene trees possess divergent evolutionary rates[S14], which is likely the case for these data. Second, MP-EST (Maximum Pseudo-likelihood for Estimating Species Trees) uses the frequencies of gene trees of a set of rooted triples for all subsets of three taxa to estimate the topology and branch lengths (in coalescent units) of the overall species tree. Derived from coalescent theory, MP-EST assumes no gene flow or horizontal gene transfer (HGT), although it has been shown to be robust to a small amount of HGT[S15]. MP-EST has been predicted to correctly estimate the species tree when internal branches are long. Because both STAR and MP-EST are based on summary statistics calculated across all gene trees, they are considered robust to violations of the assumptions that underlie many coalescent analyses (i.e., genes that diverge from the coalescent model exhibit little effect on the ability to accurately estimate the species tree) [S17]. For both analyses, gene trees were rooted using a set of Python, R, and bash scripts. Finally, gene trees were evaluated using NJst, a distance method for inferring an unrooted species tree from unrooted gene trees. Distance between two species is defined as the average number of internodes between two species across the set of gene trees; thus calculating a pairwise distance table of the average gene-tree internode distance. An NJ tree is then calculated from the distance matrix in which the entries are twice the average ranks across gene trees. Liu and Yu [S16] show that if gene trees are inferred correctly, an assumption of the method, NJst is statistically consistent in estimating topologies of unrooted species trees and performs similarly as STAR. NJst assumes there is no gene flow or HGT between species. All three methods are considered partially parametric due to their use of summary statistics, based only the topology of the gene trees, to reconstruct species phylogenies, whereas fully parametric methods use all aspects of the data to infer phylogenies. Because partially parametric methods use only part of the information contained in the data, they are thought to require more loci than fully parametric methods to achieve similar levels of confidence in the results[S14], [S17]. STAR and NJst were implemented within the PHYBASE package [S17] inside the R computing environment. MP-EST is written in C[S15] available at http://code.google.com/p/mp-est/. These analyses were also conducted on the STRAW species tree server at http://bioinformatics.publichealth.uga.edu/SpeciesTreeAnalysis/index.php. Analyses of supermatrices were conducted in RAxML 7.7.6 [S2] (ML), TNT [S3], and ExaBayes 1.2.1 (http://sco.h-its.org/exelixis/web/software/exabayes/index.html). TNT PA runs comprised 500 RAS and BS replicates using Sectorial, Drift, and Tree Fusing set to default parameters. Gaps were treating as missing; all sites were weighted equally. BI analyses on the 128 gene supermatrix comprised two independent runs each consisting of 4 chains that were run until the average deviation of split frequencies was < 5% for > 100,000 generations for the two runs. The substitution rate matrix (revMat) and branch length parameters (brlens) were linked across all partitions; all other parameters were unlinked to include the fixed substitution rate matrix (aaModel). A consensus tree and parameter summary files were produced using the “consense” and “postProcParam” post-processing tools in ExaBayes. Trees and summary support statistics for all analyses were visualized using FigTree (http://tree.bio.ed.ac.uk/software/FigTree/).

Page 13: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

Topological tree comparisons to test alternative phylogenetic hypotheses were conducted using the programs RAxML and CONSEL[S18]. ML analyses comprised 100 RAS using the same model parameters for “best tree” searches. All alternative tree topology hypotheses and the best tree (Figure 2) were combined into a single tree file a per-site log likelihood scores were calculated and output in Treepuzzle format using RAxML; model parameters were re-estimated for each tree using the –f G option. AU test scores were then calculated in CONSEL.

Morphological character ancestral state reconstructions using maximum likelihood (Mk1 model) were conducted using Mesquite 2.75 (http://mequiteproject.org).

Divergence time estimation. Lineage divergence times were estimated using a new maximum likelihood method explicitly proposed for dating nodes from large phylogenomic datasets – RelTime   [S4]. Divergence times were estimated using the 55,447 site amino acid supermatrix with only the 40 ingroup spider taxa. A starting tree was provided (the ML best tree) as the reference topology. Local clocks were used for each lineage, with no clock rates merged. The WAG substitution model was employed with gamma distributed rates among sites (and 5 discrete gamma categories), with 78 parameters. All sites were used, with missing data identified as "X" and gaps as "-". Due to the nascence of RelTime we explored a suite of different program parameters and fossil calibration min/max boundaries (17 in separate analyses in total). The best scoring tree (LnL = -832175.684 and AICc =1664507.375) and corresponding divergence estimates are reported.

Eight nodes were calibrated from the oldest spider fossils known for targeted groups (Table S4). The Mesothelae (Liphistius) were given a minimum age of 299 mya - the Mesothelae fossil, Palaeothele montceauensis (Selden, 1996), with a maximum set at 400 mya (a hypothesized earliest date for spiders - the earliest known spider-like arachnid with silk producing structures, Attercopus fimbriunguis (Shear, Selden & Rolfe, 1987) dates from 386 mya (Selden et al., 2008). Extant araneomorphs were calibrated with the oldest known fossil Triassaraneus andersonorum Selden et al., 1999 setting a minimum age at 228 mya and a maximum of 386 mya. The extant mygalomorphs were calibrated using the oldest known mygalomorph fossil Rosamygale grauvogeli Selden & Gall, 1992 setting a minimum age at 242 mya and a maximum of 386 mya. The Haplogynae + Hypochilidae were calibrated using Eoplectreurys gertschi Selden and Huang, 2010 to set a minimum age of 173 mya and a maximum age of 299 mya. The Deinopoidea (cribellate orb-weavers) were calibrated using Mongolarachne jurassica Selden et al., 2013 to set a minimum age of 161 mya and a maximum of 299 mya. The node leading to the extant ecribellate orb-weavers was calibrated using Mesozygiella dunlopi Penney and Ortuño, 2006 to set a minimum age of 125 mya and a maximum of 299 mya. The node encompassing the mygalomorph families Nemesiidae, Barychelidae, and Theraphosidae was calibrated using the nemesiid Cretamygale chasei Selden, 2002, to set a minimum age of 125 ya and a maximum of 299 mya. And finally, the antrodiaetid mygalomorph lineages (Aliatypus and Antrodiaetus) were calibrated by using Cretacattyma raveni Eskov and Zonstein, 1990 to set a minimum age of 113 mya and a maximum of 299 mya. [S19]

Page 14: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

Supplemental References

S1. Ebersberger, I., Strauss, S., and Haeseler, von, A. (2009). HaMStR: profile hidden markov model based search for orthologs in ESTs. BMC Evol Biol 9, 157.

S2. Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics.

S3. Goloboff, P. A., Farris, J. S., and Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics 24, 774–786.

S4. Tamura, K., Battistuzzi, F. U., Billing-Ross, P., Murillo, O., Filipski, A., and Kumar, S. (2012). Estimating divergence times in large molecular phylogenies. P Natl Acad Sci Usa 109, 19333–19338.

S5. Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., et al. (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652.

S6. Brewer, M. S., Brewer, M. S., Bond, J. E., and Bond, J. E. (2013). Ordinal-level phylogenomics of the arthropod class Diplopoda (millipedes) based on an analysis of 221 nuclear protein-coding loci generated using next-generation sequence analyses. PLoS ONE 8, e79935.

S7. Katoh, K., Misawa, K., Kuma, K. I., and Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30, 3059–3066.

S8. Roure, B., Rodriguez-Ezpeleta, N., and Philippe, H. (2007). SCaFoS: a tool for selection, concatenation and fusion of sequences for phylogenomics. BMC Evol Biol 7 Suppl 1, S2.

S9. Castresana, J. (2000). Selection of Conserved Blocks from Multiple Alignments for Their Use in Phylogenetic Analysis. Mol Biol Evol 17, 540–552.

S10. Kück, P., Meusemann, K., Dambach, J., Thormann, B., Reumont, von, B. M., Wägele, J. W., and Misof, B. (2010). Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Frontiers in Zoology 7, 10.

S11. Kück, P., and Meusemann, K. (2010). FASconCAT: Convenient handling of data matrices. Mol Phylogenet Evol 56, 1115–1118.

S12. Kück, P., and Struck, T. H. (2014). BaCoCa--a heuristic software tool for the parallel assessment of sequence biases in hundreds of gene and taxon partitions. Mol Phylogenet Evol 70, 94–98.

Page 15: Supplemental Information Phylogenomics Resolves a · PDF fileSupplemental Information Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for ... Hedin,

S13. Whelan, S., and Goldman, N. (2001). A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach. Mol Biol Evol 18, 691–699.

S14. Liu, L., Yu, L., Pearl, D. K., and Edwards, S. V. (2009). Estimating species phylogenies using coalescence times among sequences. Systematic Biol. 58, 468–477.

S15. Liu, L., Yu, L., and Edwards, S. V. (2010). A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol 10, 302.

S16. Liu, L., and Yu, L. (2011). Estimating species trees from unrooted gene trees. Systematic Biol. 60, 661–667.

S17. Song, S., Liu, L., and Edwards, S. V. (2012). Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proceedings of the ….

S18. Shimodaira, H., and Hasegawa, M. (2001). CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics 17, 1246–1247.

S19. Selden, P. A., Shih, C., and Ren, D. (2013). A giant spider from the Jurassic of China reveals greater diversity of the orbicularian stem group. Naturwissenschaften 100, 1171–1181.