astrid - accurate species trees from internode...

34
ASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy Warnow University of Illinois at Urbana-Champaign 1 / 23

Upload: doanthuan

Post on 26-Apr-2019

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRIDAccurate Species TRees from Internode Distances

Pranjal Vachaspati and Tandy Warnow

University of Illinois at Urbana-Champaign

1 / 23

Page 2: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

What do we want in a coalescent-based species treemethod?

1. Statistically consistent under the multi-species coalescentmodel

2. Accurate: gives trees as close to the true tree as possible3. Scalable: can run on large datasets4. Fast: gives an answer quickly on small datasets5. Easy to use

2 / 23

Page 3: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Fast, scalable, accurate, coalescent-based speciestree estimation

Recover species trees from whole genomesA--T-CG

CT--T-G

AC-T--A

GGTT--A

AA-TC--

AGT-CGA

-G-T--C

TC--TGG

CCAT--A

Gene TreesAlignments Species Tree

ATCG

CTTG

ACTA

GGTTA

AATC

AGTCGA

GTC

TCTGG

CCATA

Sequences

3 / 23

Page 4: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Fast, scalable, accurate, coalescent-based speciestree estimation

Recover species trees from whole genomes

A--T-CG

CT--T-G

AC-T--A

GGTT--A

AA-TC--

AGT-CGA

-G-T--C

TC--TGG

CCAT--A

Gene TreesAlignments Species Tree

ATCG

CTTG

ACTA

GGTTA

AATC

AGTCGA

GTC

TCTGG

CCATA

Sequences

MS

A (

PA

STA

, M

uSC

LE,

CLU

STA

L,

etc

.)

ML e

sti

mato

r (R

AxM

L,

FastT

ree,

etc

.)

3 / 23

Page 5: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Many different approaches for species tree estimation

Gene Trees Species Tree4 / 23

Page 6: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRID is designed with these goals in mind

Gene Trees

Internode

Distance

Matrices

Estimated

Species Tree

Distance Method:

FastME, BioNJ*

Average

Distance

Matrix

A

BC

DE

A

EC

DB

0 2 3 4 42 0 3 4 43 3 0 3 34 4 3 0 24 4 3 2 0

0 3 4 4 23 0 3 3 34 3 0 2 44 3 2 0 22 3 4 4 0

0 2.5 3.5 4 32.5 0 3 3.5 3.53.5 3 0 2.5 3.5 4 3.5 2.5 0 3 3 3.5 3.5 3 0 A

BC

DE

5 / 23

Page 7: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Based on NJst (Liu and Yu, 2011)

I NJst used neighbor-joining to estimate the species treefrom the distance matrix

I ASTRID uses FastME and PhyD*, which works when thedistance matrix is missing entries

I ASTRID is somewhat more accurate than NJstI ASTRID is much faster than NJst

6 / 23

Page 8: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Based on NJst (Liu and Yu, 2011)

I NJst used neighbor-joining to estimate the species treefrom the distance matrix

I ASTRID uses FastME and PhyD*, which works when thedistance matrix is missing entries

I ASTRID is somewhat more accurate than NJstI ASTRID is much faster than NJst

6 / 23

Page 9: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRID has mathematical properties guaranteeingstatistical consistency under the multi-speciescoalescent

Amount of data

Error

(Allman, Degnan, Rhodes 2016;http://arxiv.org/abs/1604.05364)

This doesn’t mean it will be accurate in practice, so we needexperiments

7 / 23

Page 10: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRID has mathematical properties guaranteeingstatistical consistency under the multi-speciescoalescent

Amount of data

Error

(Allman, Degnan, Rhodes 2016;http://arxiv.org/abs/1604.05364)This doesn’t mean it will be accurate in practice, so we needexperiments

7 / 23

Page 11: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Simulations help assess method accuracy

ATCG

CTTG

ACTA

GGTTA

AATCC

AGTCA

GGTTC

TCTGG

CCATA

Gene Trees Alignments

Species Tree

(random or

biological)

8 / 23

Page 12: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Simulations help assess method accuracy

ATCG

CTTG

ACTA

GGTTA

AATCC

AGTCA

GGTTC

TCTGG

CCATA

True Gene Trees Alignments

True Species Tree

(random or

biological)

SimPhy

Indelible

Estimated

Gene Trees

Estimated

Species Tree

COMPARE

9 / 23

Page 13: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Simulated datasets

Dataset # taxa # genes AD% # sitesMammalian Sim12 37 200 21-50 250-1000

Avian Sim34 48 1000 29-60 250-1500SimPhy5 50-1000 1000 9-69 300-1500

AD% measures the amount of ILS in a dataset: the average RFdistance between the true gene trees and the true species tree.

1Generated by Mirarab et al., Bioinformatics 20142Based on biological data from Song et al., PNAS 20123Generated by Mirarab et al., Science 20144Based on biological dataset from Jarvis, Mirarab et al., Science 20145Mallo et al., 2015. "SimPhy: Comprehensive simulation of gene, locus

and species trees at the genome-wide level."10 / 23

Page 14: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Both ASTRAL and ASTRID are substantially moreaccurate than MP-EST

500 1000Sequence length (bp)

0.00

0.05

0.10

0.15

0.20

0.25

RF

 Top

olog

ical

 Err

or

Avian Da)ase) (48 taxa) wi)h 500 genes, 47% AD

ASTRAL­2ASTRID­fastmeMP­EST

11 / 23

Page 15: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRID is sometimes more accurate than ASTRAL

10 25 50 100 200 400 800Number of genes

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45RF Topo

logical E

rror

Avian Simulated (48 taxa)ASTRAL-2ASTRID-fastme

47% ILS, 500bp sequences

12 / 23

Page 16: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Sometimes ASTRID and ASTRAL have similaraccuracy

10 25 50 100 200Number of genes

0.00

0.05

0.10

0.15

0.20

RF Topo

logical E

rror

Mammalian Simulated (37 taxa)ASTRAL-2ASTRID-fastme

29% ILS, 500bp sequences

13 / 23

Page 17: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRAL is sometimes more accurate than ASTRID

10 25 50 100Number of genes

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

RF Topo

logical E

rror

500K generations1e-6 speciation rate

ASTRAL-2ASTRID-fastme

200 taxa, 69% ILS, 300-1500bp sequences

14 / 23

Page 18: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

What if trees are missing taxa?

0.0

0.2

0.4

0.6

0.8

1.0

RF

Topo

logi

cal E

rror

a) 50 taxa, missing 10 per treeASTRAL-2ASTRID-bionj

b) 50 taxa, missing 20 per tree

25 50 100 250 500 1000Number of genes

0.0

0.2

0.4

0.6

0.8

1.0

RF

Topo

logi

cal E

rror

c) 50 taxa, missing 30 per tree

25 50 100 250 500 1000Number of genes

d) 50 taxa, missing 40 per tree

30% ILS, 300bp sequences

With enough genes, even high levels of randomly missing taxaare OK

15 / 23

Page 19: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRID is fast and scalable48-taxon avian simulated dataset

0 200 400 600 800Number of genes

0

50

100

150

200

250

300

350R

unni

ng ti

me

(s)

NJstASTRAL-2

ASTRID-fastme

Single-processor runtimes 16 / 23

Page 20: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRID is fast and scalable

I 1000 genes and 1000 taxa: (simulated dataset)ASTRAL takes 12 hours, ASTRID-FastME takes 0.008hours (30 seconds)!

I New 1kp dataset: 1178 taxa, 478 genes also takes 30seconds

I Running time scales linearly with number of genes,quadratically with number of taxa

I Memory usage is constant with number of genes,quadratic with number of taxa

17 / 23

Page 21: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRID is fast and scalable

I 1000 genes and 1000 taxa: (simulated dataset)ASTRAL takes 12 hours, ASTRID-FastME takes 0.008hours (30 seconds)!

I New 1kp dataset: 1178 taxa, 478 genes also takes 30seconds

I Running time scales linearly with number of genes,quadratically with number of taxa

I Memory usage is constant with number of genes,quadratic with number of taxa

17 / 23

Page 22: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRID is easy to useI Available in source and binary form on github

github.com/pranjalv123/ASTRID

I Just one command to run:ASTRID -i genetrees -o speciestree

I Can calculate multi-locus bootstrap supportI With ASTRAL, calculate quartet-based branch support and

estimate branch lengths

18 / 23

Page 23: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRID is easy to useI Available in source and binary form on github

github.com/pranjalv123/ASTRID

I Just one command to run:ASTRID -i genetrees -o speciestree

I Can calculate multi-locus bootstrap supportI With ASTRAL, calculate quartet-based branch support and

estimate branch lengths

18 / 23

Page 24: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

ASTRID is easy to useI Available in source and binary form on github

github.com/pranjalv123/ASTRID

I Just one command to run:ASTRID -i genetrees -o speciestree

I Can calculate multi-locus bootstrap supportI With ASTRAL, calculate quartet-based branch support and

estimate branch lengths18 / 23

Page 25: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Reliable trees on biological datasetsI Corrected mammalian

dataset (Song et al.,2012) with erroneousgenes removed.

I ASTRID tree on BestMLgene trees with MLBSbranch support.

I Same result as ASTRAL,concatenatedmaximum-likelihood(RAxML).

I Single core: 100bootstrap replicates in 6seconds!

I ASTRAL takes 600seconds for 100 reps

19 / 23

Page 26: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Reliable trees on biological datasetsI Corrected mammalian

dataset (Song et al.,2012) with erroneousgenes removed.

I ASTRID tree on BestMLgene trees with MLBSbranch support.

I Same result as ASTRAL,concatenatedmaximum-likelihood(RAxML).

I Single core: 100bootstrap replicates in 6seconds!

I ASTRAL takes 600seconds for 100 reps

19 / 23

Page 27: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

When should you use ASTRID?

More details at the tutorial: 11:15 AM Friday in MR7

I Gene tree heterogeneity due to ILSI Many genes - ASTRID typically does best on 100+ genes.I Many taxa - ASTRID scales to large datasetsI Most pairs of taxa appear in at least one tree - ASTRID

performs poorly when the distance matrix is missing a lotof elements

20 / 23

Page 28: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

When should you use ASTRID?

More details at the tutorial: 11:15 AM Friday in MR7I Gene tree heterogeneity due to ILS

I Many genes - ASTRID typically does best on 100+ genes.I Many taxa - ASTRID scales to large datasetsI Most pairs of taxa appear in at least one tree - ASTRID

performs poorly when the distance matrix is missing a lotof elements

20 / 23

Page 29: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

When should you use ASTRID?

More details at the tutorial: 11:15 AM Friday in MR7I Gene tree heterogeneity due to ILSI Many genes - ASTRID typically does best on 100+ genes.

I Many taxa - ASTRID scales to large datasetsI Most pairs of taxa appear in at least one tree - ASTRID

performs poorly when the distance matrix is missing a lotof elements

20 / 23

Page 30: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

When should you use ASTRID?

More details at the tutorial: 11:15 AM Friday in MR7I Gene tree heterogeneity due to ILSI Many genes - ASTRID typically does best on 100+ genes.I Many taxa - ASTRID scales to large datasets

I Most pairs of taxa appear in at least one tree - ASTRIDperforms poorly when the distance matrix is missing a lotof elements

20 / 23

Page 31: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

When should you use ASTRID?

More details at the tutorial: 11:15 AM Friday in MR7I Gene tree heterogeneity due to ILSI Many genes - ASTRID typically does best on 100+ genes.I Many taxa - ASTRID scales to large datasetsI Most pairs of taxa appear in at least one tree - ASTRID

performs poorly when the distance matrix is missing a lotof elements

20 / 23

Page 32: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Acknowledgements

I Funding: Roy J. Carver graduate fellowship; Ira and DebraCohen fellowship, UIUC (PV); NSF GRFP (PV), NSFDBI-1461364, Grainger Foundation.

21 / 23

Page 33: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

References1. Vachaspati, Pranjal, and Tandy Warnow. "ASTRID:

Accurate Species TRees from Internode Distances." BMCgenomics 16.Suppl 10 (2015): S3.

2. Lefort, Vincent, Richard Desper, and Olivier Gascuel."FastME 2.0: a comprehensive, accurate, and fastdistance-based phylogeny inference program." Molecularbiology and evolution (2015): msv150.

3. Criscuolo, Alexis, and Olivier Gascuel. "Fast NJ-likealgorithms to deal with incomplete distance matrices."BMC bioinformatics 9.1 (2008): 166.

4. Liu, Liang, and Lili Yu. "Estimating species trees fromunrooted gene trees." Systematic biology 60.5 (2011):661-667.

5. Allman, Elizabeth S., James H. Degnan, and John A.Rhodes. "Species tree inference from gene splits byUnrooted STAR methods." arXiv preprint arXiv:1604.05364(2016).

22 / 23

Page 34: ASTRID - Accurate Species TRees from Internode Distancestandy.cs.illinois.edu/pranjal-symposium.pdfASTRID Accurate Species TRees from Internode Distances Pranjal Vachaspati and Tandy

Tutorial + Software download

I Available for Mac, Linux, and Windows athttps://github.com/pranjalv123/ASTRID

I Tutorial tomorrow (Friday) at 11:15 AM, MR7

23 / 23