astrid - accurate species trees from internode...
TRANSCRIPT
ASTRIDAccurate Species TRees from Internode Distances
Pranjal Vachaspati and Tandy Warnow
University of Illinois at Urbana-Champaign
1 / 23
What do we want in a coalescent-based species treemethod?
1. Statistically consistent under the multi-species coalescentmodel
2. Accurate: gives trees as close to the true tree as possible3. Scalable: can run on large datasets4. Fast: gives an answer quickly on small datasets5. Easy to use
2 / 23
Fast, scalable, accurate, coalescent-based speciestree estimation
Recover species trees from whole genomesA--T-CG
CT--T-G
AC-T--A
GGTT--A
AA-TC--
AGT-CGA
-G-T--C
TC--TGG
CCAT--A
Gene TreesAlignments Species Tree
ATCG
CTTG
ACTA
GGTTA
AATC
AGTCGA
GTC
TCTGG
CCATA
Sequences
3 / 23
Fast, scalable, accurate, coalescent-based speciestree estimation
Recover species trees from whole genomes
A--T-CG
CT--T-G
AC-T--A
GGTT--A
AA-TC--
AGT-CGA
-G-T--C
TC--TGG
CCAT--A
Gene TreesAlignments Species Tree
ATCG
CTTG
ACTA
GGTTA
AATC
AGTCGA
GTC
TCTGG
CCATA
Sequences
MS
A (
PA
STA
, M
uSC
LE,
CLU
STA
L,
etc
.)
ML e
sti
mato
r (R
AxM
L,
FastT
ree,
etc
.)
3 / 23
Many different approaches for species tree estimation
Gene Trees Species Tree4 / 23
ASTRID is designed with these goals in mind
Gene Trees
Internode
Distance
Matrices
Estimated
Species Tree
Distance Method:
FastME, BioNJ*
Average
Distance
Matrix
A
BC
DE
A
EC
DB
0 2 3 4 42 0 3 4 43 3 0 3 34 4 3 0 24 4 3 2 0
0 3 4 4 23 0 3 3 34 3 0 2 44 3 2 0 22 3 4 4 0
0 2.5 3.5 4 32.5 0 3 3.5 3.53.5 3 0 2.5 3.5 4 3.5 2.5 0 3 3 3.5 3.5 3 0 A
BC
DE
5 / 23
Based on NJst (Liu and Yu, 2011)
I NJst used neighbor-joining to estimate the species treefrom the distance matrix
I ASTRID uses FastME and PhyD*, which works when thedistance matrix is missing entries
I ASTRID is somewhat more accurate than NJstI ASTRID is much faster than NJst
6 / 23
Based on NJst (Liu and Yu, 2011)
I NJst used neighbor-joining to estimate the species treefrom the distance matrix
I ASTRID uses FastME and PhyD*, which works when thedistance matrix is missing entries
I ASTRID is somewhat more accurate than NJstI ASTRID is much faster than NJst
6 / 23
ASTRID has mathematical properties guaranteeingstatistical consistency under the multi-speciescoalescent
Amount of data
Error
(Allman, Degnan, Rhodes 2016;http://arxiv.org/abs/1604.05364)
This doesn’t mean it will be accurate in practice, so we needexperiments
7 / 23
ASTRID has mathematical properties guaranteeingstatistical consistency under the multi-speciescoalescent
Amount of data
Error
(Allman, Degnan, Rhodes 2016;http://arxiv.org/abs/1604.05364)This doesn’t mean it will be accurate in practice, so we needexperiments
7 / 23
Simulations help assess method accuracy
ATCG
CTTG
ACTA
GGTTA
AATCC
AGTCA
GGTTC
TCTGG
CCATA
Gene Trees Alignments
Species Tree
(random or
biological)
8 / 23
Simulations help assess method accuracy
ATCG
CTTG
ACTA
GGTTA
AATCC
AGTCA
GGTTC
TCTGG
CCATA
True Gene Trees Alignments
True Species Tree
(random or
biological)
SimPhy
Indelible
Estimated
Gene Trees
Estimated
Species Tree
COMPARE
9 / 23
Simulated datasets
Dataset # taxa # genes AD% # sitesMammalian Sim12 37 200 21-50 250-1000
Avian Sim34 48 1000 29-60 250-1500SimPhy5 50-1000 1000 9-69 300-1500
AD% measures the amount of ILS in a dataset: the average RFdistance between the true gene trees and the true species tree.
1Generated by Mirarab et al., Bioinformatics 20142Based on biological data from Song et al., PNAS 20123Generated by Mirarab et al., Science 20144Based on biological dataset from Jarvis, Mirarab et al., Science 20145Mallo et al., 2015. "SimPhy: Comprehensive simulation of gene, locus
and species trees at the genome-wide level."10 / 23
Both ASTRAL and ASTRID are substantially moreaccurate than MP-EST
500 1000Sequence length (bp)
0.00
0.05
0.10
0.15
0.20
0.25
RF
Top
olog
ical
Err
or
Avian Da)ase) (48 taxa) wi)h 500 genes, 47% AD
ASTRAL2ASTRIDfastmeMPEST
11 / 23
ASTRID is sometimes more accurate than ASTRAL
10 25 50 100 200 400 800Number of genes
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45RF Topo
logical E
rror
Avian Simulated (48 taxa)ASTRAL-2ASTRID-fastme
47% ILS, 500bp sequences
12 / 23
Sometimes ASTRID and ASTRAL have similaraccuracy
10 25 50 100 200Number of genes
0.00
0.05
0.10
0.15
0.20
RF Topo
logical E
rror
Mammalian Simulated (37 taxa)ASTRAL-2ASTRID-fastme
29% ILS, 500bp sequences
13 / 23
ASTRAL is sometimes more accurate than ASTRID
10 25 50 100Number of genes
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
RF Topo
logical E
rror
500K generations1e-6 speciation rate
ASTRAL-2ASTRID-fastme
200 taxa, 69% ILS, 300-1500bp sequences
14 / 23
What if trees are missing taxa?
0.0
0.2
0.4
0.6
0.8
1.0
RF
Topo
logi
cal E
rror
a) 50 taxa, missing 10 per treeASTRAL-2ASTRID-bionj
b) 50 taxa, missing 20 per tree
25 50 100 250 500 1000Number of genes
0.0
0.2
0.4
0.6
0.8
1.0
RF
Topo
logi
cal E
rror
c) 50 taxa, missing 30 per tree
25 50 100 250 500 1000Number of genes
d) 50 taxa, missing 40 per tree
30% ILS, 300bp sequences
With enough genes, even high levels of randomly missing taxaare OK
15 / 23
ASTRID is fast and scalable48-taxon avian simulated dataset
0 200 400 600 800Number of genes
0
50
100
150
200
250
300
350R
unni
ng ti
me
(s)
NJstASTRAL-2
ASTRID-fastme
Single-processor runtimes 16 / 23
ASTRID is fast and scalable
I 1000 genes and 1000 taxa: (simulated dataset)ASTRAL takes 12 hours, ASTRID-FastME takes 0.008hours (30 seconds)!
I New 1kp dataset: 1178 taxa, 478 genes also takes 30seconds
I Running time scales linearly with number of genes,quadratically with number of taxa
I Memory usage is constant with number of genes,quadratic with number of taxa
17 / 23
ASTRID is fast and scalable
I 1000 genes and 1000 taxa: (simulated dataset)ASTRAL takes 12 hours, ASTRID-FastME takes 0.008hours (30 seconds)!
I New 1kp dataset: 1178 taxa, 478 genes also takes 30seconds
I Running time scales linearly with number of genes,quadratically with number of taxa
I Memory usage is constant with number of genes,quadratic with number of taxa
17 / 23
ASTRID is easy to useI Available in source and binary form on github
github.com/pranjalv123/ASTRID
I Just one command to run:ASTRID -i genetrees -o speciestree
I Can calculate multi-locus bootstrap supportI With ASTRAL, calculate quartet-based branch support and
estimate branch lengths
18 / 23
ASTRID is easy to useI Available in source and binary form on github
github.com/pranjalv123/ASTRID
I Just one command to run:ASTRID -i genetrees -o speciestree
I Can calculate multi-locus bootstrap supportI With ASTRAL, calculate quartet-based branch support and
estimate branch lengths
18 / 23
ASTRID is easy to useI Available in source and binary form on github
github.com/pranjalv123/ASTRID
I Just one command to run:ASTRID -i genetrees -o speciestree
I Can calculate multi-locus bootstrap supportI With ASTRAL, calculate quartet-based branch support and
estimate branch lengths18 / 23
Reliable trees on biological datasetsI Corrected mammalian
dataset (Song et al.,2012) with erroneousgenes removed.
I ASTRID tree on BestMLgene trees with MLBSbranch support.
I Same result as ASTRAL,concatenatedmaximum-likelihood(RAxML).
I Single core: 100bootstrap replicates in 6seconds!
I ASTRAL takes 600seconds for 100 reps
19 / 23
Reliable trees on biological datasetsI Corrected mammalian
dataset (Song et al.,2012) with erroneousgenes removed.
I ASTRID tree on BestMLgene trees with MLBSbranch support.
I Same result as ASTRAL,concatenatedmaximum-likelihood(RAxML).
I Single core: 100bootstrap replicates in 6seconds!
I ASTRAL takes 600seconds for 100 reps
19 / 23
When should you use ASTRID?
More details at the tutorial: 11:15 AM Friday in MR7
I Gene tree heterogeneity due to ILSI Many genes - ASTRID typically does best on 100+ genes.I Many taxa - ASTRID scales to large datasetsI Most pairs of taxa appear in at least one tree - ASTRID
performs poorly when the distance matrix is missing a lotof elements
20 / 23
When should you use ASTRID?
More details at the tutorial: 11:15 AM Friday in MR7I Gene tree heterogeneity due to ILS
I Many genes - ASTRID typically does best on 100+ genes.I Many taxa - ASTRID scales to large datasetsI Most pairs of taxa appear in at least one tree - ASTRID
performs poorly when the distance matrix is missing a lotof elements
20 / 23
When should you use ASTRID?
More details at the tutorial: 11:15 AM Friday in MR7I Gene tree heterogeneity due to ILSI Many genes - ASTRID typically does best on 100+ genes.
I Many taxa - ASTRID scales to large datasetsI Most pairs of taxa appear in at least one tree - ASTRID
performs poorly when the distance matrix is missing a lotof elements
20 / 23
When should you use ASTRID?
More details at the tutorial: 11:15 AM Friday in MR7I Gene tree heterogeneity due to ILSI Many genes - ASTRID typically does best on 100+ genes.I Many taxa - ASTRID scales to large datasets
I Most pairs of taxa appear in at least one tree - ASTRIDperforms poorly when the distance matrix is missing a lotof elements
20 / 23
When should you use ASTRID?
More details at the tutorial: 11:15 AM Friday in MR7I Gene tree heterogeneity due to ILSI Many genes - ASTRID typically does best on 100+ genes.I Many taxa - ASTRID scales to large datasetsI Most pairs of taxa appear in at least one tree - ASTRID
performs poorly when the distance matrix is missing a lotof elements
20 / 23
Acknowledgements
I Funding: Roy J. Carver graduate fellowship; Ira and DebraCohen fellowship, UIUC (PV); NSF GRFP (PV), NSFDBI-1461364, Grainger Foundation.
21 / 23
References1. Vachaspati, Pranjal, and Tandy Warnow. "ASTRID:
Accurate Species TRees from Internode Distances." BMCgenomics 16.Suppl 10 (2015): S3.
2. Lefort, Vincent, Richard Desper, and Olivier Gascuel."FastME 2.0: a comprehensive, accurate, and fastdistance-based phylogeny inference program." Molecularbiology and evolution (2015): msv150.
3. Criscuolo, Alexis, and Olivier Gascuel. "Fast NJ-likealgorithms to deal with incomplete distance matrices."BMC bioinformatics 9.1 (2008): 166.
4. Liu, Liang, and Lili Yu. "Estimating species trees fromunrooted gene trees." Systematic biology 60.5 (2011):661-667.
5. Allman, Elizabeth S., James H. Degnan, and John A.Rhodes. "Species tree inference from gene splits byUnrooted STAR methods." arXiv preprint arXiv:1604.05364(2016).
22 / 23
Tutorial + Software download
I Available for Mac, Linux, and Windows athttps://github.com/pranjalv123/ASTRID
I Tutorial tomorrow (Friday) at 11:15 AM, MR7
23 / 23