astral-iii: increased scalability and impacts of...

44
ASTRAL-III: increased scalability and impacts of contracting low support branches Chao Zhang Maryam Rabiee Erfan Sayyari Siavash Mirarab University of California, San Diego

Upload: others

Post on 15-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

ASTRAL-III: increased scalability and impacts of contracting low support branches

Chao Zhang Maryam Rabiee

Erfan Sayyari Siavash Mirarab

University of California, San Diego

Gene tree discordance

2

OrangutanGorilla ChimpHuman

The species tree

A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

Causes of gene tree discordance include:• Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT)

gene1000gene 1

Incomplete Lineage Sorting (ILS)A random process related to the coalescence of alleles across various populations

3

Tracing alleles through generations

Incomplete Lineage Sorting (ILS)A random process related to the coalescence of alleles across various populations

3

Tracing alleles through generations

Incomplete Lineage Sorting (ILS)A random process related to the coalescence of alleles across various populations

3

Tracing alleles through generations

Multi-species coalescent (MSC) model captures ILS

Unrooted quartets under MSC model

For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)

4

Orang.

Gorilla Chimp

HumanOrang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

θ2=15% θ3=15%θ1=70%Gorilla

Orang.

Chimp

Human

d=0.8

Unrooted quartets under MSC model

For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)

4

Orang.

Gorilla Chimp

HumanOrang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

θ2=15% θ3=15%θ1=70%Gorilla

Orang.

Chimp

Human

d=0.8

The most frequent gene tree = The most likely species tree

5

Orang.Gorilla ChimpHuman Rhesus

More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)

5

Orang.Gorilla ChimpHuman Rhesus

1. Break gene trees into (n 4 ) quartets of species

2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees

Some tools (e.g.. BUCKy-p [Larget, et al., 2010])

More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)

5

Orang.Gorilla ChimpHuman Rhesus

1. Break gene trees into (n 4 ) quartets of species

2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees

Some tools (e.g.. BUCKy-p [Larget, et al., 2010])

More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)

Alternative: weight all 3(n

4 ) quartet topologies by their frequency

and find the optimal tree

Maximum Quartet Support Species Tree

• Optimization problem:

6

Find the species tree with the maximum number of induced quartet trees shared with the collection of k input gene trees

the set of quartet trees induced by T

a gene treeScore(T ) =

mX

1

|Q(T ) \Q(ti)|

Maximum Quartet Support Species Tree

• Optimization problem:

• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly

6

Find the species tree with the maximum number of induced quartet trees shared with the collection of k input gene trees

the set of quartet trees induced by T

a gene treeScore(T ) =

mX

1

|Q(T ) \Q(ti)|

Maximum Quartet Support Species Tree

• Optimization problem:

• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly

6

Find the species tree with the maximum number of induced quartet trees shared with the collection of k input gene trees

the set of quartet trees induced by T

a gene tree

NP-Hard [Lafond & Scornavaccaori, 2016]

Score(T ) =mX

1

|Q(T ) \Q(ti)|

ASTRAL-I [Mirarab, et al., Bioinformatics, 2014]

• ASTRAL solves the problem using dynamic programming

7

ASTRAL-I [Mirarab, et al., Bioinformatics, 2014]

• ASTRAL solves the problem using dynamic programming

• Introduced a constrained version of the problem

• Draws branches in the species tree from a given set X = {all bipartitions in all gene trees}

• Th constrained version is statistically consistent

• Running time: O(n2k |X |1.73)

7

ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]

1. Faster calculation of the score function inside DP

• O(nk |X |1.73) instead of O(n 2k |X |1.73)

8

ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]

1. Faster calculation of the score function inside DP

• O(nk |X |1.73) instead of O(n 2k |X |1.73)

2. Add extra bipartitions to the set X using heuristics

• Consensus + support + subsampling species

• Using quartet-based distances to find likely branches

• Complete incomplete gene trees

8

ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]

1. Faster calculation of the score function inside DP

• O(nk |X |1.73) instead of O(n 2k |X |1.73)

2. Add extra bipartitions to the set X using heuristics

• Consensus + support + subsampling species

• Using quartet-based distances to find likely branches

• Complete incomplete gene trees

3. Can handle input gene trees with polytomies in O(n3k |X |1.73)

8

ASTRAL used by the biologists• Plants: Wickett, et al., 2014, PNAS • Birds: Prum, et al., 2015, Nature • Xenoturbella, Cannon et al., 2016, Nature • Xenoturbella, Rouse et al., 2016, Nature • Flatworms: Laumer, et al., 2015, eLife • Shrews: Giarla, et al., 2015, Syst. Bio. • Frogs: Yuan et al., 2016, Syst. Bio. • Tomatoes: Pease, et al., 2016, PLoS Bio. • Angiosperms: Huang et al., 2016, MBE • Worms: Andrade, et al., 2015, MBE

AST

RA

L-II

AST

RA

L

ASTRAL-III: improved running time

Feature 1:

ASTRAL-III can now analyze multifurcating gene trees with the same running time as binary gene trees: O(nk |X |1.73)

10

Zhang et al. Page 14 of 34

(a)

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75

2−12

2−10

2−8

2−6

contractionAv

g we

ight

cal

cula

tion

time

(sec

onds

) 1600

200

ASTRAL−II

ASTRAL−III

(b)

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●● ● ● ●

●●

●●

● ● ● ● ● ●●

● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●

●●

●● ● ● ●

●●

●●

● ● ● ● ●

●●

● ● ● ● ● ●●

●●

●● ● ● ● ●

●●

●●

●●

● ●●

● ● ● ● ● ●

● ● ● ● ●

● ●

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750

25

50

75

100

contraction

Set X

clu

ster

size

(K)

Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.

3.2.2 Running time for large polytomies

ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the

running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and

ASTRAL-III can have a di↵erent set X, we show the running time per each weight

calculation (i.e., Eq. 3). As we contract low support branches and hence increase the

prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,

whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.

3.2.3 The search space

Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-

III shows that as intended, the search space is decreased in size for cases with no

polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-

tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few

error-prone gene trees (50 gene trees from 200bp alignments), the search space has

reduced dramatically but with many genes or high-quality gene trees, the reduc-

tions are minimal. Moreover, the search space for gene trees estimated from short

alignments (e.g., 200bp) is several times larger than those based on longer align-

ments (e.g., 1600bp) for both methods. Contracting low support branches initially

increases the search space because ASTRAL-III adds multiple resolutions per poly-

tomy to X. Further contraction results in reductions in |X|, presumably because

many polytomies exist and they are resolved similarly inside ASTRAL-III.

Zhang et al. Page 14 of 34

(a)

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75

2−12

2−10

2−8

2−6

contraction

Avg

weig

ht c

alcu

latio

n tim

e (s

econ

ds) 1600

200

ASTRAL−II

ASTRAL−III

(b)

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●● ● ● ●

●●

●●

● ● ● ● ● ●●

● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●

●●

●● ● ● ●

●●

●●

● ● ● ● ●

●●

● ● ● ● ● ●●

●●

●● ● ● ● ●

●●

●●

●●

● ●●

● ● ● ● ● ●

● ● ● ● ●

● ●

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750

25

50

75

100

contraction

Set X

clu

ster

size

(K)

Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.

3.2.2 Running time for large polytomies

ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the

running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and

ASTRAL-III can have a di↵erent set X, we show the running time per each weight

calculation (i.e., Eq. 3). As we contract low support branches and hence increase the

prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,

whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.

3.2.3 The search space

Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-

III shows that as intended, the search space is decreased in size for cases with no

polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-

tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few

error-prone gene trees (50 gene trees from 200bp alignments), the search space has

reduced dramatically but with many genes or high-quality gene trees, the reduc-

tions are minimal. Moreover, the search space for gene trees estimated from short

alignments (e.g., 200bp) is several times larger than those based on longer align-

ments (e.g., 1600bp) for both methods. Contracting low support branches initially

increases the search space because ASTRAL-III adds multiple resolutions per poly-

tomy to X. Further contraction results in reductions in |X|, presumably because

many polytomies exist and they are resolved similarly inside ASTRAL-III.

Low gene tree errorHigh gene tree error

more polytomies

ASTRAL-III: improved running time

Feature 2: Can exploit similarities between gene trees

• O(D |X |1.73) where D is the sum of degrees of all unique gene tree nodes

• Overlay all gene trees onto a single polytree; use a bottom-up traversal to score a cluster

11

Zhang et al. Page 30 of 34

●● ●●

●● ● ●

●● ●●

●● ● ●●

500 1500

0 5000 10000 15000 0 5000 10000 15000

0

500

1000

1500

#Genes

Run

ning

tim

e (m

inut

es)

ASTRAL2

ASTRAL3

Figure S2 Running time versus k. Average running time of ASTRAL-II versus ASTRAL-III onthe avian dataset with 500bp or 1500bp alignments with varying numbers of gens (k), shown innormal scale. A LOESS curve is fit to the data points. ASTRAL-II could not finish on 2

14 genes inthe allotted 48 hour time. Averages are over 4 runs.

(a)

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75

2−12

2−10

2−8

2−6

contraction

Avg

weig

ht c

alcu

latio

n tim

e (s

econ

ds) ASTRAL−II

ASTRAL−III

400

800

(b)

●●

● ● ● ● ● ● ●●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●● ● ● ● ●

●● ● ●●

● ● ● ● ●

● ●

● ● ● ● ● ●●

● ● ●●

● ● ● ● ●●

●●

●●

●● ●

● ●

● ● ●●

● ●● ● ● ●

●●

● ● ● ●●

●●

●●

● ●●

● ●●

●●

● ● ●●

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750

20000

40000

60000

contraction

Set X

clu

ster

size

400 800 ● ●ASTRAL−II ASTRAL−III

Figure S3 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| for both ASTRAL-II(red) and ASTRAL-III (blue) on the S100 dataset. Running time is in log scale for varyingnumbers of gene trees (boxes) and sequence length 400 and 800 (line types).

Zhang et al. Page 13 of 34

2.272.09

2.292.06

500 1500

28 29 210 211 212 213 214 28 29 210 211 212 213 214

20

25

210

#Genes

Run

ning

tim

e (m

inut

es)

ASTRAL2

ASTRAL3

Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 2

14 genes in 48 hour.

branches has minimal impact on the discordance (eight discordant branches with

binned MP-EST instead of nine). However, contracting low support branches with

3%–33% thresholds dramatically reduces the discordance with the reference tree (2,

2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and

33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The

remaining di↵erences are among the branches that are deemed unresolved by Jarvis

et al. and change among the reference trees as well [5]. Contracting at 50% and 75%

thresholds, however, increases discordance to five and six branches, respectively.

Thus, consistent with simulations, contracting very low support branches seems

to produce the best results, when judged by similarity with the reference trees. To

summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with

all major relations in Jarvis et al., including the novel Columbea group, whereas

the unresolved tree missed important clades (Fig. 4).

3.2 RQ2: Running time improvements

We study the improvements in running time as various parameters change.

3.2.1 Varying the number of genes (k)

We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing

the number of genes from 28 to 214 and forcing X to be the same for both versions to

enable comparing impacts of improved weight calculation. We allow each replicate

run to take up to two days. ASTRAL-II is not able to finish on the dataset with

k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the

running time over ASTRAL-II and the extent of the improvement depends on k

(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With

213 genes, the largest value where both versions could run, ASTRAL-III finishes

on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,

fitting a line to the average running time in the log-log scale graph reveals that

on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),

which is better than that of ASTRAL-II at O(k2.28), and both are better than the

theoretical worst case, which is O(k2.726).

ASTRAL-III: improved running time

Feature 3. An A*-like algorithm is used to trim the dynamic programming

• Compute an upper-bound on the score of a cluster without “expanding” it

• If the upper bound is below the best existing score, don’t expand

12

New features (journal version)4. Restrict the search space to guarantee |X |=O(nk)

• Overall running time is O(D(nk)1.73)= O((nk)2.73)

• Does not impact accuracy

• Further improves speed

13

New features (journal version)4. Restrict the search space to guarantee |X |=O(nk)

• Overall running time is O(D(nk)1.73)= O((nk)2.73)

• Does not impact accuracy

• Further improves speed

5. The A*-like algorithm is further developed to compute . . even tighter upper bounds (complicated)

• Improves speed only empirically and slightly

13

Empirical running time

Empirical running time seems close to O((nk)2)

14

Zhang et al. Page 34 of 34

● ●

● ●

1.09

10k

20k

50k

100k

200k

500k

1M

200 500 1000 2000 5000 10000#Species

Size

of X

1.83

1

10

100

1000

200 500 1000 2000 5000 10000#Species

Run

ning

tim

e (m

inut

es)

Figure S8 Empirical running time of ASTRAL-III with n. Average running time is shown forASTRAL-III for datasets with varying n. Averages are over 20 replicates. One replicate of 2000species dataset could not finish in 2 days and is removed from the analysis. Note that thesedatasets have factors other than n that change as well (e.g., the amount of ILS, etc.). Thus, theserunning times should be treated as ball-park estimates. Finally, we note that on the 10,000dataset, we have only 2 replicates and not 20.

Zhang et al. Page 13 of 34

2.272.09

2.292.06

500 1500

28 29 210 211 212 213 214 28 29 210 211 212 213 214

20

25

210

#Genes

Run

ning

tim

e (m

inut

es)

ASTRAL2

ASTRAL3

Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 2

14 genes in 48 hour.

branches has minimal impact on the discordance (eight discordant branches with

binned MP-EST instead of nine). However, contracting low support branches with

3%–33% thresholds dramatically reduces the discordance with the reference tree (2,

2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and

33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The

remaining di↵erences are among the branches that are deemed unresolved by Jarvis

et al. and change among the reference trees as well [5]. Contracting at 50% and 75%

thresholds, however, increases discordance to five and six branches, respectively.

Thus, consistent with simulations, contracting very low support branches seems

to produce the best results, when judged by similarity with the reference trees. To

summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with

all major relations in Jarvis et al., including the novel Columbea group, whereas

the unresolved tree missed important clades (Fig. 4).

3.2 RQ2: Running time improvements

We study the improvements in running time as various parameters change.

3.2.1 Varying the number of genes (k)

We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing

the number of genes from 28 to 214 and forcing X to be the same for both versions to

enable comparing impacts of improved weight calculation. We allow each replicate

run to take up to two days. ASTRAL-II is not able to finish on the dataset with

k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the

running time over ASTRAL-II and the extent of the improvement depends on k

(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With

213 genes, the largest value where both versions could run, ASTRAL-III finishes

on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,

fitting a line to the average running time in the log-log scale graph reveals that

on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),

which is better than that of ASTRAL-II at O(k2.28), and both are better than the

theoretical worst case, which is O(k2.726).

Zhang et al. Page 13 of 34

2.272.09

2.292.06

500 1500

28 29 210 211 212 213 214 28 29 210 211 212 213 214

20

25

210

#Genes

Run

ning

tim

e (m

inut

es)

ASTRAL2

ASTRAL3

Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 2

14 genes in 48 hour.

branches has minimal impact on the discordance (eight discordant branches with

binned MP-EST instead of nine). However, contracting low support branches with

3%–33% thresholds dramatically reduces the discordance with the reference tree (2,

2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and

33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The

remaining di↵erences are among the branches that are deemed unresolved by Jarvis

et al. and change among the reference trees as well [5]. Contracting at 50% and 75%

thresholds, however, increases discordance to five and six branches, respectively.

Thus, consistent with simulations, contracting very low support branches seems

to produce the best results, when judged by similarity with the reference trees. To

summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with

all major relations in Jarvis et al., including the novel Columbea group, whereas

the unresolved tree missed important clades (Fig. 4).

3.2 RQ2: Running time improvements

We study the improvements in running time as various parameters change.

3.2.1 Varying the number of genes (k)

We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing

the number of genes from 28 to 214 and forcing X to be the same for both versions to

enable comparing impacts of improved weight calculation. We allow each replicate

run to take up to two days. ASTRAL-II is not able to finish on the dataset with

k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the

running time over ASTRAL-II and the extent of the improvement depends on k

(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With

213 genes, the largest value where both versions could run, ASTRAL-III finishes

on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,

fitting a line to the average running time in the log-log scale graph reveals that

on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),

which is better than that of ASTRAL-II at O(k2.28), and both are better than the

theoretical worst case, which is O(k2.726).

Zhang et al. Page 13 of 34

2.272.09

2.292.06

500 1500

28 29 210 211 212 213 214 28 29 210 211 212 213 214

20

25

210

#Genes

Run

ning

tim

e (m

inut

es)

ASTRAL2

ASTRAL3

Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 2

14 genes in 48 hour.

branches has minimal impact on the discordance (eight discordant branches with

binned MP-EST instead of nine). However, contracting low support branches with

3%–33% thresholds dramatically reduces the discordance with the reference tree (2,

2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and

33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The

remaining di↵erences are among the branches that are deemed unresolved by Jarvis

et al. and change among the reference trees as well [5]. Contracting at 50% and 75%

thresholds, however, increases discordance to five and six branches, respectively.

Thus, consistent with simulations, contracting very low support branches seems

to produce the best results, when judged by similarity with the reference trees. To

summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with

all major relations in Jarvis et al., including the novel Columbea group, whereas

the unresolved tree missed important clades (Fig. 4).

3.2 RQ2: Running time improvements

We study the improvements in running time as various parameters change.

3.2.1 Varying the number of genes (k)

We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing

the number of genes from 28 to 214 and forcing X to be the same for both versions to

enable comparing impacts of improved weight calculation. We allow each replicate

run to take up to two days. ASTRAL-II is not able to finish on the dataset with

k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the

running time over ASTRAL-II and the extent of the improvement depends on k

(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With

213 genes, the largest value where both versions could run, ASTRAL-III finishes

on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,

fitting a line to the average running time in the log-log scale graph reveals that

on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),

which is better than that of ASTRAL-II at O(k2.28), and both are better than the

theoretical worst case, which is O(k2.726).

Low support branches• Does it help to contract

branches with low support?

• Yes, but only for very low supports

15

Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping

genes

Low support branches• Does it help to contract

branches with low support?

• Yes, but only for very low supports

• Mostly helps in the presence of low support gene trees

15

Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping

genes

Low support branches• Does it help to contract

branches with low support?

• Yes, but only for very low supports

• Mostly helps in the presence of low support gene trees

• More genes allows for more filtering

15

Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping

genes

Avian biological dataset

16

Zhang et al. Page 12 of 34

Downy woodpecker

Yellow-throated sandgrouse

Common cuckoo

Carmine bee-eater

Anna's hummingbird

MacQueen's bustard

Great-crested grebeTurkey

Golden-collared manakin

White-tailed tropicbird

Cuckoo-roller

Red-legged seriema

Rhinoceros hornbill

Zebra finch

Common ostrich

Adelie penguin

Kea

Medium ground-finch

Peregrine falcon

American flamingo

Budgerigar

Little egret

White-throated tinamou

Bald eagle

Barn owl

Hoatzin

Turkey vulture

Killdeer

Dalmatian pelican

Bar-tailed trogon

Chuck-will's-widow

Crested ibis

Pekin duck

Pigeon

Speckled mousebird

Grey-crowned crane

Red-throated loon

Chicken

Sunbittern

Great cormorant

Red-crested turaco

Rifleman

Brown mesite

Northern fulmarEmperor penguin

American crow

Chimney swift

White-tailed eagle

50

88

97

59

80

91

97

99

58

Columbea

Turkey vulture

Great-crested grebe

Common cuckoo

Red-crested turaco

White-throated tinamou

Speckled mousebird

Adelie penguin

Anna's hummingbird

Pigeon

Dalmatian pelican

American flamingo

Great cormorant

Grey-crowned crane

Medium ground-finch

Red-throated loon

Kea

Cuckoo-roller

Hoatzin

Peregrine falcon

White-tailed tropicbird

Brown mesite

Bar-tailed trogon

Bald eagle

Rifleman

White-tailed eagle

Northern fulmar

Chuck-will's-widow

Turkey

Carmine bee-eater

Zebra finch

Little egret

Pekin duck

Crested ibis

MacQueen's bustard

Golden-collared manakin

Common ostrich

Rhinoceros hornbill

Emperor penguin

Killdeer

Chimney swift

Red-legged seriema

Downy woodpecker

Sunbittern

Chicken

Budgerigar

American crow

Yellow-throated sandgrouse

Barn owl

99

97

89

88

97

9860

88

90

98

56

Sunbittern

KeaPeregrine falcon

Killdeer

Budgerigar

Zebra finch

Rifleman

Bald eagle

Speckled mousebirdBarn owl

Golden-collared manakin

Little egret

Northern fulmar

Turkey vulture

Carmine bee-eater

Hoatzin

Rhinoceros hornbill

Chimney swift

American flamingo

Red-legged seriema

Medium ground-finch

Red-crested turaco

Common ostrich

Chuck-will's-widow

Great-crested grebe

Yellow-throated sandgrousePigeon

ChickenTurkey

American crow

White-throated tinamou

MacQueen's bustard

White-tailed eagle

Anna's hummingbird

Cuckoo-roller

Emperor penguin

Bar-tailed trogon

Great cormorant

Grey-crowned crane

White-tailed tropicbirdRed-throated loon

Brown mesite

Downy woodpecker

Adelie penguin

Pekin duck

Crested ibisDalmatian pelican

Common cuckoo91

96

72

91

84

55

9691

70

Speckled mousebird

Little egret

Brown mesite

Pekin duck

Barn owl

Bar-tailed trogon

Bald eagle

Anna's hummingbird

Downy woodpecker

Chicken

Red-legged seriema

Zebra finch

MacQueen's bustard

Northern fulmar

Rifleman

American crow

Peregrine falcon

Rhinoceros hornbill

Emperor penguin

Cuckoo-roller

Red-throated loon

Great cormorant

Common ostrich

Chuck-will's-widow

Adelie penguin

White-tailed tropicbird

Red-crested turaco

Golden-collared manakin

White-tailed eagleTurkey vulture

Budgerigar

Dalmatian pelican

Pigeon

Carmine bee-eater

Turkey

Medium ground-finch

Grey-crowned crane

Great-crested grebe

Killdeer

Chimney swift

Sunbittern

Hoatzin

White-throated tinamou

Common cuckoo

Kea

Crested ibis

Yellow-throated sandgrouse

American flamingo

91,90,70

98,99,100

84,92,95

92,93,85

87,91,8552,59,74

80,73,70

a) Binned MP-EST(coalescent-basedpublished tree)

b) ExaML concatenation(published tree)

c) Unbinned ASTRAL - no contraction

d) Unbinned ASTRAL -3%, 5%, 10% contraction

Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.

MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]

that were largely congruent with the concatenation using ExaML [45] and di↵ered

in only five branches with low support (Fig. 4ab); both trees were used as the

reference [5]. Here, we test if simply contracting low support gene tree branches

and using ASTRAL-III produces a tree that is congruent with these reference trees.

Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-

traction, the results di↵er in nine and 11 branches, respectively, with respect to

the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this

tree contradicts some strong results from the avian analyses (e.g., not recovering

the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,

but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50

hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%

14K gene trees with very low gene tree resolution (25% mean BS)

Avian biological dataset

16

Zhang et al. Page 12 of 34

Downy woodpecker

Yellow-throated sandgrouse

Common cuckoo

Carmine bee-eater

Anna's hummingbird

MacQueen's bustard

Great-crested grebeTurkey

Golden-collared manakin

White-tailed tropicbird

Cuckoo-roller

Red-legged seriema

Rhinoceros hornbill

Zebra finch

Common ostrich

Adelie penguin

Kea

Medium ground-finch

Peregrine falcon

American flamingo

Budgerigar

Little egret

White-throated tinamou

Bald eagle

Barn owl

Hoatzin

Turkey vulture

Killdeer

Dalmatian pelican

Bar-tailed trogon

Chuck-will's-widow

Crested ibis

Pekin duck

Pigeon

Speckled mousebird

Grey-crowned crane

Red-throated loon

Chicken

Sunbittern

Great cormorant

Red-crested turaco

Rifleman

Brown mesite

Northern fulmarEmperor penguin

American crow

Chimney swift

White-tailed eagle

50

88

97

59

80

91

97

99

58

Columbea

Turkey vulture

Great-crested grebe

Common cuckoo

Red-crested turaco

White-throated tinamou

Speckled mousebird

Adelie penguin

Anna's hummingbird

Pigeon

Dalmatian pelican

American flamingo

Great cormorant

Grey-crowned crane

Medium ground-finch

Red-throated loon

Kea

Cuckoo-roller

Hoatzin

Peregrine falcon

White-tailed tropicbird

Brown mesite

Bar-tailed trogon

Bald eagle

Rifleman

White-tailed eagle

Northern fulmar

Chuck-will's-widow

Turkey

Carmine bee-eater

Zebra finch

Little egret

Pekin duck

Crested ibis

MacQueen's bustard

Golden-collared manakin

Common ostrich

Rhinoceros hornbill

Emperor penguin

Killdeer

Chimney swift

Red-legged seriema

Downy woodpecker

Sunbittern

Chicken

Budgerigar

American crow

Yellow-throated sandgrouse

Barn owl

99

97

89

88

97

9860

88

90

98

56

Sunbittern

KeaPeregrine falcon

Killdeer

Budgerigar

Zebra finch

Rifleman

Bald eagle

Speckled mousebirdBarn owl

Golden-collared manakin

Little egret

Northern fulmar

Turkey vulture

Carmine bee-eater

Hoatzin

Rhinoceros hornbill

Chimney swift

American flamingo

Red-legged seriema

Medium ground-finch

Red-crested turaco

Common ostrich

Chuck-will's-widow

Great-crested grebe

Yellow-throated sandgrousePigeon

ChickenTurkey

American crow

White-throated tinamou

MacQueen's bustard

White-tailed eagle

Anna's hummingbird

Cuckoo-roller

Emperor penguin

Bar-tailed trogon

Great cormorant

Grey-crowned crane

White-tailed tropicbirdRed-throated loon

Brown mesite

Downy woodpecker

Adelie penguin

Pekin duck

Crested ibisDalmatian pelican

Common cuckoo91

96

72

91

84

55

9691

70

Speckled mousebird

Little egret

Brown mesite

Pekin duck

Barn owl

Bar-tailed trogon

Bald eagle

Anna's hummingbird

Downy woodpecker

Chicken

Red-legged seriema

Zebra finch

MacQueen's bustard

Northern fulmar

Rifleman

American crow

Peregrine falcon

Rhinoceros hornbill

Emperor penguin

Cuckoo-roller

Red-throated loon

Great cormorant

Common ostrich

Chuck-will's-widow

Adelie penguin

White-tailed tropicbird

Red-crested turaco

Golden-collared manakin

White-tailed eagleTurkey vulture

Budgerigar

Dalmatian pelican

Pigeon

Carmine bee-eater

Turkey

Medium ground-finch

Grey-crowned crane

Great-crested grebe

Killdeer

Chimney swift

Sunbittern

Hoatzin

White-throated tinamou

Common cuckoo

Kea

Crested ibis

Yellow-throated sandgrouse

American flamingo

91,90,70

98,99,100

84,92,95

92,93,85

87,91,8552,59,74

80,73,70

a) Binned MP-EST(coalescent-basedpublished tree)

b) ExaML concatenation(published tree)

c) Unbinned ASTRAL - no contraction

d) Unbinned ASTRAL -3%, 5%, 10% contraction

Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.

MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]

that were largely congruent with the concatenation using ExaML [45] and di↵ered

in only five branches with low support (Fig. 4ab); both trees were used as the

reference [5]. Here, we test if simply contracting low support gene tree branches

and using ASTRAL-III produces a tree that is congruent with these reference trees.

Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-

traction, the results di↵er in nine and 11 branches, respectively, with respect to

the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this

tree contradicts some strong results from the avian analyses (e.g., not recovering

the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,

but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50

hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%

Zhang et al. Page 12 of 34

Downy woodpecker

Yellow-throated sandgrouse

Common cuckoo

Carmine bee-eater

Anna's hummingbird

MacQueen's bustard

Great-crested grebeTurkey

Golden-collared manakin

White-tailed tropicbird

Cuckoo-roller

Red-legged seriema

Rhinoceros hornbill

Zebra finch

Common ostrich

Adelie penguin

Kea

Medium ground-finch

Peregrine falcon

American flamingo

Budgerigar

Little egret

White-throated tinamou

Bald eagle

Barn owl

Hoatzin

Turkey vulture

Killdeer

Dalmatian pelican

Bar-tailed trogon

Chuck-will's-widow

Crested ibis

Pekin duck

Pigeon

Speckled mousebird

Grey-crowned crane

Red-throated loon

Chicken

Sunbittern

Great cormorant

Red-crested turaco

Rifleman

Brown mesite

Northern fulmarEmperor penguin

American crow

Chimney swift

White-tailed eagle

50

88

97

59

80

91

97

99

58

Columbea

Turkey vulture

Great-crested grebe

Common cuckoo

Red-crested turaco

White-throated tinamou

Speckled mousebird

Adelie penguin

Anna's hummingbird

Pigeon

Dalmatian pelican

American flamingo

Great cormorant

Grey-crowned crane

Medium ground-finch

Red-throated loon

Kea

Cuckoo-roller

Hoatzin

Peregrine falcon

White-tailed tropicbird

Brown mesite

Bar-tailed trogon

Bald eagle

Rifleman

White-tailed eagle

Northern fulmar

Chuck-will's-widow

Turkey

Carmine bee-eater

Zebra finch

Little egret

Pekin duck

Crested ibis

MacQueen's bustard

Golden-collared manakin

Common ostrich

Rhinoceros hornbill

Emperor penguin

Killdeer

Chimney swift

Red-legged seriema

Downy woodpecker

Sunbittern

Chicken

Budgerigar

American crow

Yellow-throated sandgrouse

Barn owl

99

97

89

88

97

9860

88

90

98

56

Sunbittern

KeaPeregrine falcon

Killdeer

Budgerigar

Zebra finch

Rifleman

Bald eagle

Speckled mousebirdBarn owl

Golden-collared manakin

Little egret

Northern fulmar

Turkey vulture

Carmine bee-eater

Hoatzin

Rhinoceros hornbill

Chimney swift

American flamingo

Red-legged seriema

Medium ground-finch

Red-crested turaco

Common ostrich

Chuck-will's-widow

Great-crested grebe

Yellow-throated sandgrousePigeon

ChickenTurkey

American crow

White-throated tinamou

MacQueen's bustard

White-tailed eagle

Anna's hummingbird

Cuckoo-roller

Emperor penguin

Bar-tailed trogon

Great cormorant

Grey-crowned crane

White-tailed tropicbirdRed-throated loon

Brown mesite

Downy woodpecker

Adelie penguin

Pekin duck

Crested ibisDalmatian pelican

Common cuckoo91

96

72

91

84

55

9691

70

Speckled mousebird

Little egret

Brown mesite

Pekin duck

Barn owl

Bar-tailed trogon

Bald eagle

Anna's hummingbird

Downy woodpecker

Chicken

Red-legged seriema

Zebra finch

MacQueen's bustard

Northern fulmar

Rifleman

American crow

Peregrine falcon

Rhinoceros hornbill

Emperor penguin

Cuckoo-roller

Red-throated loon

Great cormorant

Common ostrich

Chuck-will's-widow

Adelie penguin

White-tailed tropicbird

Red-crested turaco

Golden-collared manakin

White-tailed eagleTurkey vulture

Budgerigar

Dalmatian pelican

Pigeon

Carmine bee-eater

Turkey

Medium ground-finch

Grey-crowned crane

Great-crested grebe

Killdeer

Chimney swift

Sunbittern

Hoatzin

White-throated tinamou

Common cuckoo

Kea

Crested ibis

Yellow-throated sandgrouse

American flamingo

91,90,70

98,99,100

84,92,95

92,93,85

87,91,8552,59,74

80,73,70

a) Binned MP-EST(coalescent-basedpublished tree)

b) ExaML concatenation(published tree)

c) Unbinned ASTRAL - no contraction

d) Unbinned ASTRAL -3%, 5%, 10% contraction

Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.

MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]

that were largely congruent with the concatenation using ExaML [45] and di↵ered

in only five branches with low support (Fig. 4ab); both trees were used as the

reference [5]. Here, we test if simply contracting low support gene tree branches

and using ASTRAL-III produces a tree that is congruent with these reference trees.

Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-

traction, the results di↵er in nine and 11 branches, respectively, with respect to

the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this

tree contradicts some strong results from the avian analyses (e.g., not recovering

the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,

but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50

hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%

14K gene trees with very low gene tree resolution (25% mean BS)

Avian biological dataset

16

Zhang et al. Page 12 of 34

Downy woodpecker

Yellow-throated sandgrouse

Common cuckoo

Carmine bee-eater

Anna's hummingbird

MacQueen's bustard

Great-crested grebeTurkey

Golden-collared manakin

White-tailed tropicbird

Cuckoo-roller

Red-legged seriema

Rhinoceros hornbill

Zebra finch

Common ostrich

Adelie penguin

Kea

Medium ground-finch

Peregrine falcon

American flamingo

Budgerigar

Little egret

White-throated tinamou

Bald eagle

Barn owl

Hoatzin

Turkey vulture

Killdeer

Dalmatian pelican

Bar-tailed trogon

Chuck-will's-widow

Crested ibis

Pekin duck

Pigeon

Speckled mousebird

Grey-crowned crane

Red-throated loon

Chicken

Sunbittern

Great cormorant

Red-crested turaco

Rifleman

Brown mesite

Northern fulmarEmperor penguin

American crow

Chimney swift

White-tailed eagle

50

88

97

59

80

91

97

99

58

Columbea

Turkey vulture

Great-crested grebe

Common cuckoo

Red-crested turaco

White-throated tinamou

Speckled mousebird

Adelie penguin

Anna's hummingbird

Pigeon

Dalmatian pelican

American flamingo

Great cormorant

Grey-crowned crane

Medium ground-finch

Red-throated loon

Kea

Cuckoo-roller

Hoatzin

Peregrine falcon

White-tailed tropicbird

Brown mesite

Bar-tailed trogon

Bald eagle

Rifleman

White-tailed eagle

Northern fulmar

Chuck-will's-widow

Turkey

Carmine bee-eater

Zebra finch

Little egret

Pekin duck

Crested ibis

MacQueen's bustard

Golden-collared manakin

Common ostrich

Rhinoceros hornbill

Emperor penguin

Killdeer

Chimney swift

Red-legged seriema

Downy woodpecker

Sunbittern

Chicken

Budgerigar

American crow

Yellow-throated sandgrouse

Barn owl

99

97

89

88

97

9860

88

90

98

56

Sunbittern

KeaPeregrine falcon

Killdeer

Budgerigar

Zebra finch

Rifleman

Bald eagle

Speckled mousebirdBarn owl

Golden-collared manakin

Little egret

Northern fulmar

Turkey vulture

Carmine bee-eater

Hoatzin

Rhinoceros hornbill

Chimney swift

American flamingo

Red-legged seriema

Medium ground-finch

Red-crested turaco

Common ostrich

Chuck-will's-widow

Great-crested grebe

Yellow-throated sandgrousePigeon

ChickenTurkey

American crow

White-throated tinamou

MacQueen's bustard

White-tailed eagle

Anna's hummingbird

Cuckoo-roller

Emperor penguin

Bar-tailed trogon

Great cormorant

Grey-crowned crane

White-tailed tropicbirdRed-throated loon

Brown mesite

Downy woodpecker

Adelie penguin

Pekin duck

Crested ibisDalmatian pelican

Common cuckoo91

96

72

91

84

55

9691

70

Speckled mousebird

Little egret

Brown mesite

Pekin duck

Barn owl

Bar-tailed trogon

Bald eagle

Anna's hummingbird

Downy woodpecker

Chicken

Red-legged seriema

Zebra finch

MacQueen's bustard

Northern fulmar

Rifleman

American crow

Peregrine falcon

Rhinoceros hornbill

Emperor penguin

Cuckoo-roller

Red-throated loon

Great cormorant

Common ostrich

Chuck-will's-widow

Adelie penguin

White-tailed tropicbird

Red-crested turaco

Golden-collared manakin

White-tailed eagleTurkey vulture

Budgerigar

Dalmatian pelican

Pigeon

Carmine bee-eater

Turkey

Medium ground-finch

Grey-crowned crane

Great-crested grebe

Killdeer

Chimney swift

Sunbittern

Hoatzin

White-throated tinamou

Common cuckoo

Kea

Crested ibis

Yellow-throated sandgrouse

American flamingo

91,90,70

98,99,100

84,92,95

92,93,85

87,91,8552,59,74

80,73,70

a) Binned MP-EST(coalescent-basedpublished tree)

b) ExaML concatenation(published tree)

c) Unbinned ASTRAL - no contraction

d) Unbinned ASTRAL -3%, 5%, 10% contraction

Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.

MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]

that were largely congruent with the concatenation using ExaML [45] and di↵ered

in only five branches with low support (Fig. 4ab); both trees were used as the

reference [5]. Here, we test if simply contracting low support gene tree branches

and using ASTRAL-III produces a tree that is congruent with these reference trees.

Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-

traction, the results di↵er in nine and 11 branches, respectively, with respect to

the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this

tree contradicts some strong results from the avian analyses (e.g., not recovering

the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,

but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50

hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%

Zhang et al. Page 12 of 34

Downy woodpecker

Yellow-throated sandgrouse

Common cuckoo

Carmine bee-eater

Anna's hummingbird

MacQueen's bustard

Great-crested grebeTurkey

Golden-collared manakin

White-tailed tropicbird

Cuckoo-roller

Red-legged seriema

Rhinoceros hornbill

Zebra finch

Common ostrich

Adelie penguin

Kea

Medium ground-finch

Peregrine falcon

American flamingo

Budgerigar

Little egret

White-throated tinamou

Bald eagle

Barn owl

Hoatzin

Turkey vulture

Killdeer

Dalmatian pelican

Bar-tailed trogon

Chuck-will's-widow

Crested ibis

Pekin duck

Pigeon

Speckled mousebird

Grey-crowned crane

Red-throated loon

Chicken

Sunbittern

Great cormorant

Red-crested turaco

Rifleman

Brown mesite

Northern fulmarEmperor penguin

American crow

Chimney swift

White-tailed eagle

50

88

97

59

80

91

97

99

58

Columbea

Turkey vulture

Great-crested grebe

Common cuckoo

Red-crested turaco

White-throated tinamou

Speckled mousebird

Adelie penguin

Anna's hummingbird

Pigeon

Dalmatian pelican

American flamingo

Great cormorant

Grey-crowned crane

Medium ground-finch

Red-throated loon

Kea

Cuckoo-roller

Hoatzin

Peregrine falcon

White-tailed tropicbird

Brown mesite

Bar-tailed trogon

Bald eagle

Rifleman

White-tailed eagle

Northern fulmar

Chuck-will's-widow

Turkey

Carmine bee-eater

Zebra finch

Little egret

Pekin duck

Crested ibis

MacQueen's bustard

Golden-collared manakin

Common ostrich

Rhinoceros hornbill

Emperor penguin

Killdeer

Chimney swift

Red-legged seriema

Downy woodpecker

Sunbittern

Chicken

Budgerigar

American crow

Yellow-throated sandgrouse

Barn owl

99

97

89

88

97

9860

88

90

98

56

Sunbittern

KeaPeregrine falcon

Killdeer

Budgerigar

Zebra finch

Rifleman

Bald eagle

Speckled mousebirdBarn owl

Golden-collared manakin

Little egret

Northern fulmar

Turkey vulture

Carmine bee-eater

Hoatzin

Rhinoceros hornbill

Chimney swift

American flamingo

Red-legged seriema

Medium ground-finch

Red-crested turaco

Common ostrich

Chuck-will's-widow

Great-crested grebe

Yellow-throated sandgrousePigeon

ChickenTurkey

American crow

White-throated tinamou

MacQueen's bustard

White-tailed eagle

Anna's hummingbird

Cuckoo-roller

Emperor penguin

Bar-tailed trogon

Great cormorant

Grey-crowned crane

White-tailed tropicbirdRed-throated loon

Brown mesite

Downy woodpecker

Adelie penguin

Pekin duck

Crested ibisDalmatian pelican

Common cuckoo91

96

72

91

84

55

9691

70

Speckled mousebird

Little egret

Brown mesite

Pekin duck

Barn owl

Bar-tailed trogon

Bald eagle

Anna's hummingbird

Downy woodpecker

Chicken

Red-legged seriema

Zebra finch

MacQueen's bustard

Northern fulmar

Rifleman

American crow

Peregrine falcon

Rhinoceros hornbill

Emperor penguin

Cuckoo-roller

Red-throated loon

Great cormorant

Common ostrich

Chuck-will's-widow

Adelie penguin

White-tailed tropicbird

Red-crested turaco

Golden-collared manakin

White-tailed eagleTurkey vulture

Budgerigar

Dalmatian pelican

Pigeon

Carmine bee-eater

Turkey

Medium ground-finch

Grey-crowned crane

Great-crested grebe

Killdeer

Chimney swift

Sunbittern

Hoatzin

White-throated tinamou

Common cuckoo

Kea

Crested ibis

Yellow-throated sandgrouse

American flamingo

91,90,70

98,99,100

84,92,95

92,93,85

87,91,8552,59,74

80,73,70

a) Binned MP-EST(coalescent-basedpublished tree)

b) ExaML concatenation(published tree)

c) Unbinned ASTRAL - no contraction

d) Unbinned ASTRAL -3%, 5%, 10% contraction

Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.

MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]

that were largely congruent with the concatenation using ExaML [45] and di↵ered

in only five branches with low support (Fig. 4ab); both trees were used as the

reference [5]. Here, we test if simply contracting low support gene tree branches

and using ASTRAL-III produces a tree that is congruent with these reference trees.

Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-

traction, the results di↵er in nine and 11 branches, respectively, with respect to

the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this

tree contradicts some strong results from the avian analyses (e.g., not recovering

the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,

but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50

hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%

Zhang et al. Page 12 of 34

Downy woodpecker

Yellow-throated sandgrouse

Common cuckoo

Carmine bee-eater

Anna's hummingbird

MacQueen's bustard

Great-crested grebeTurkey

Golden-collared manakin

White-tailed tropicbird

Cuckoo-roller

Red-legged seriema

Rhinoceros hornbill

Zebra finch

Common ostrich

Adelie penguin

Kea

Medium ground-finch

Peregrine falcon

American flamingo

Budgerigar

Little egret

White-throated tinamou

Bald eagle

Barn owl

Hoatzin

Turkey vulture

Killdeer

Dalmatian pelican

Bar-tailed trogon

Chuck-will's-widow

Crested ibis

Pekin duck

Pigeon

Speckled mousebird

Grey-crowned crane

Red-throated loon

Chicken

Sunbittern

Great cormorant

Red-crested turaco

Rifleman

Brown mesite

Northern fulmarEmperor penguin

American crow

Chimney swift

White-tailed eagle

50

88

97

59

80

91

97

99

58

Columbea

Turkey vulture

Great-crested grebe

Common cuckoo

Red-crested turaco

White-throated tinamou

Speckled mousebird

Adelie penguin

Anna's hummingbird

Pigeon

Dalmatian pelican

American flamingo

Great cormorant

Grey-crowned crane

Medium ground-finch

Red-throated loon

Kea

Cuckoo-roller

Hoatzin

Peregrine falcon

White-tailed tropicbird

Brown mesite

Bar-tailed trogon

Bald eagle

Rifleman

White-tailed eagle

Northern fulmar

Chuck-will's-widow

Turkey

Carmine bee-eater

Zebra finch

Little egret

Pekin duck

Crested ibis

MacQueen's bustard

Golden-collared manakin

Common ostrich

Rhinoceros hornbill

Emperor penguin

Killdeer

Chimney swift

Red-legged seriema

Downy woodpecker

Sunbittern

Chicken

Budgerigar

American crow

Yellow-throated sandgrouse

Barn owl

99

97

89

88

97

9860

88

90

98

56

Sunbittern

KeaPeregrine falcon

Killdeer

Budgerigar

Zebra finch

Rifleman

Bald eagle

Speckled mousebirdBarn owl

Golden-collared manakin

Little egret

Northern fulmar

Turkey vulture

Carmine bee-eater

Hoatzin

Rhinoceros hornbill

Chimney swift

American flamingo

Red-legged seriema

Medium ground-finch

Red-crested turaco

Common ostrich

Chuck-will's-widow

Great-crested grebe

Yellow-throated sandgrousePigeon

ChickenTurkey

American crow

White-throated tinamou

MacQueen's bustard

White-tailed eagle

Anna's hummingbird

Cuckoo-roller

Emperor penguin

Bar-tailed trogon

Great cormorant

Grey-crowned crane

White-tailed tropicbirdRed-throated loon

Brown mesite

Downy woodpecker

Adelie penguin

Pekin duck

Crested ibisDalmatian pelican

Common cuckoo91

96

72

91

84

55

9691

70

Speckled mousebird

Little egret

Brown mesite

Pekin duck

Barn owl

Bar-tailed trogon

Bald eagle

Anna's hummingbird

Downy woodpecker

Chicken

Red-legged seriema

Zebra finch

MacQueen's bustard

Northern fulmar

Rifleman

American crow

Peregrine falcon

Rhinoceros hornbill

Emperor penguin

Cuckoo-roller

Red-throated loon

Great cormorant

Common ostrich

Chuck-will's-widow

Adelie penguin

White-tailed tropicbird

Red-crested turaco

Golden-collared manakin

White-tailed eagleTurkey vulture

Budgerigar

Dalmatian pelican

Pigeon

Carmine bee-eater

Turkey

Medium ground-finch

Grey-crowned crane

Great-crested grebe

Killdeer

Chimney swift

Sunbittern

Hoatzin

White-throated tinamou

Common cuckoo

Kea

Crested ibis

Yellow-throated sandgrouse

American flamingo

91,90,70

98,99,100

84,92,95

92,93,85

87,91,8552,59,74

80,73,70

a) Binned MP-EST(coalescent-basedpublished tree)

b) ExaML concatenation(published tree)

c) Unbinned ASTRAL - no contraction

d) Unbinned ASTRAL -3%, 5%, 10% contraction

Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.

MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]

that were largely congruent with the concatenation using ExaML [45] and di↵ered

in only five branches with low support (Fig. 4ab); both trees were used as the

reference [5]. Here, we test if simply contracting low support gene tree branches

and using ASTRAL-III produces a tree that is congruent with these reference trees.

Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-

traction, the results di↵er in nine and 11 branches, respectively, with respect to

the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this

tree contradicts some strong results from the avian analyses (e.g., not recovering

the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,

but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50

hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%

14K gene trees with very low gene tree resolution (25% mean BS)

Moving forward …• ASTRAL, like all other two-step approaches, is

sensitive to errors in the input gene trees

• What else can be done to reduce impacts of gene tree error?

17

Moving forward …• ASTRAL, like all other two-step approaches, is

sensitive to errors in the input gene trees

• What else can be done to reduce impacts of gene tree error?

• ASTRAL scales to 10K species. We have 90K bacterial genomes.

• Can we scale further? … divide-and-conquer …

17

Erfan  Sayyari Maryam RabieeChao ZhangTandy  Warnow

19

Zhang et al. Page 14 of 34

(a)

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75

2−12

2−10

2−8

2−6

contraction

Avg

weig

ht c

alcu

latio

n tim

e (s

econ

ds) 1600

200

ASTRAL−II

ASTRAL−III

(b)

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●● ● ● ●

●●

●●

● ● ● ● ● ●●

● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●

●●

●● ● ● ●

●●

●●

● ● ● ● ●

●●

● ● ● ● ● ●●

●●

●● ● ● ● ●

●●

●●

●●

● ●●

● ● ● ● ● ●

● ● ● ● ●

● ●

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750

25

50

75

100

contraction

Set X

clu

ster

size

(K)

Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.

3.2.2 Running time for large polytomies

ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the

running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and

ASTRAL-III can have a di↵erent set X, we show the running time per each weight

calculation (i.e., Eq. 3). As we contract low support branches and hence increase the

prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,

whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.

3.2.3 The search space

Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-

III shows that as intended, the search space is decreased in size for cases with no

polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-

tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few

error-prone gene trees (50 gene trees from 200bp alignments), the search space has

reduced dramatically but with many genes or high-quality gene trees, the reduc-

tions are minimal. Moreover, the search space for gene trees estimated from short

alignments (e.g., 200bp) is several times larger than those based on longer align-

ments (e.g., 1600bp) for both methods. Contracting low support branches initially

increases the search space because ASTRAL-III adds multiple resolutions per poly-

tomy to X. Further contraction results in reductions in |X|, presumably because

many polytomies exist and they are resolved similarly inside ASTRAL-III.

Zhang et al. Page 14 of 34

(a)

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75

2−12

2−10

2−8

2−6

contraction

Avg

weig

ht c

alcu

latio

n tim

e (s

econ

ds) 1600

200

ASTRAL−II

ASTRAL−III

(b)

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●● ● ● ●

●●

●●

● ● ● ● ● ●●

● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●

●●

●● ● ● ●

●●

●●

● ● ● ● ●

●●

● ● ● ● ● ●●

●●

●● ● ● ● ●

●●

●●

●●

● ●●

● ● ● ● ● ●

● ● ● ● ●

● ●

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750

25

50

75

100

contraction

Set X

clu

ster

size

(K)

Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.

3.2.2 Running time for large polytomies

ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the

running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and

ASTRAL-III can have a di↵erent set X, we show the running time per each weight

calculation (i.e., Eq. 3). As we contract low support branches and hence increase the

prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,

whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.

3.2.3 The search space

Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-

III shows that as intended, the search space is decreased in size for cases with no

polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-

tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few

error-prone gene trees (50 gene trees from 200bp alignments), the search space has

reduced dramatically but with many genes or high-quality gene trees, the reduc-

tions are minimal. Moreover, the search space for gene trees estimated from short

alignments (e.g., 200bp) is several times larger than those based on longer align-

ments (e.g., 1600bp) for both methods. Contracting low support branches initially

increases the search space because ASTRAL-III adds multiple resolutions per poly-

tomy to X. Further contraction results in reductions in |X|, presumably because

many polytomies exist and they are resolved similarly inside ASTRAL-III.

Zhang et al. Page 14 of 34

(a)

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75

2−12

2−10

2−8

2−6

contraction

Avg

weig

ht c

alcu

latio

n tim

e (s

econ

ds) 1600

200

ASTRAL−II

ASTRAL−III

(b)

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●● ● ● ●

●●

●●

● ● ● ● ● ●●

● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●

●●

●● ● ● ●

●●

●●

● ● ● ● ●

●●

● ● ● ● ● ●●

●●

●● ● ● ● ●

●●

●●

●●

● ●●

● ● ● ● ● ●

● ● ● ● ●

● ●

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750

25

50

75

100

contraction

Set X

clu

ster

size

(K)

Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.

3.2.2 Running time for large polytomies

ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the

running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and

ASTRAL-III can have a di↵erent set X, we show the running time per each weight

calculation (i.e., Eq. 3). As we contract low support branches and hence increase the

prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,

whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.

3.2.3 The search space

Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-

III shows that as intended, the search space is decreased in size for cases with no

polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-

tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few

error-prone gene trees (50 gene trees from 200bp alignments), the search space has

reduced dramatically but with many genes or high-quality gene trees, the reduc-

tions are minimal. Moreover, the search space for gene trees estimated from short

alignments (e.g., 200bp) is several times larger than those based on longer align-

ments (e.g., 1600bp) for both methods. Contracting low support branches initially

increases the search space because ASTRAL-III adds multiple resolutions per poly-

tomy to X. Further contraction results in reductions in |X|, presumably because

many polytomies exist and they are resolved similarly inside ASTRAL-III.

Zhang et al. Page 14 of 34

(a)

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75

2−12

2−10

2−8

2−6

contraction

Avg

weig

ht c

alcu

latio

n tim

e (s

econ

ds) 1600

200

ASTRAL−II

ASTRAL−III

(b)

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●● ● ● ●

●●

●●

● ● ● ● ● ●●

● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●

●●

●● ● ● ●

●●

●●

● ● ● ● ●

●●

● ● ● ● ● ●●

●●

●● ● ● ● ●

●●

●●

●●

● ●●

● ● ● ● ● ●

● ● ● ● ●

● ●

50 200 500 1000

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750

25

50

75

100

contraction

Set X

clu

ster

size

(K)

Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.

3.2.2 Running time for large polytomies

ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the

running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and

ASTRAL-III can have a di↵erent set X, we show the running time per each weight

calculation (i.e., Eq. 3). As we contract low support branches and hence increase the

prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,

whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.

3.2.3 The search space

Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-

III shows that as intended, the search space is decreased in size for cases with no

polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-

tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few

error-prone gene trees (50 gene trees from 200bp alignments), the search space has

reduced dramatically but with many genes or high-quality gene trees, the reduc-

tions are minimal. Moreover, the search space for gene trees estimated from short

alignments (e.g., 200bp) is several times larger than those based on longer align-

ments (e.g., 1600bp) for both methods. Contracting low support branches initially

increases the search space because ASTRAL-III adds multiple resolutions per poly-

tomy to X. Further contraction results in reductions in |X|, presumably because

many polytomies exist and they are resolved similarly inside ASTRAL-III.

Further improvements• Use a polytree to overly all the

input gene trees into one data structure

• Allows us to spend time for each unique node in gene trees once

• O(|D|(nm)1.73)

• 3X running time improvement

20

unique nodes in gene trees = O(nm)

Dynamic programming

21

A L-A

L

Dynamic programming

21

A

A-A’A’A-A’A’A-A’A’

L-A

• Recursively break subsets of species into smaller subsets

L

Dynamic programming

21

A

A-A’A’A-A’A’A-A’A’

L-A

A-A’

A’

L-A

• Recursively break subsets of species into smaller subsets

• : Compare u against input gene trees and compute quartets from gene trees satisfied by u

u

L

w(u)

Dynamic programming

21

A

A-A’A’A-A’A’A-A’A’

L-A

A-A’

A’

L-A

• Recursively break subsets of species into smaller subsets

• : Compare u against input gene trees and compute quartets from gene trees satisfied by u

Exponential

u

L

w(u)

Constrained version

22

A

A-A’A’A-A’A’A-A’A’

L-A

L

A1

A4

A8

A5

A2

A6

A3A7

A9

A10

Constrained version

22

A

A-A’A’A-A’A’A-A’A’

L-A

L

A1

A4

A8

A5

A2

A6

A3A7

A9

A10

• Restrict “branches” in the species tree to a given constraint set X.

Asymptotic running time?• Simple discrete math question:

• X = a set of subsets of some set L. Y = { (a ,b) ∈ X | a∩b=∅ , a∪b ∈ X }

• Clearly, |Y| < |X|2

• What’s the maximum |Y| with respect to |X|?

23

Asymptotic running time?• Simple discrete math question:

• X = a set of subsets of some set L. Y = { (a ,b) ∈ X | a∩b=∅ , a∪b ∈ X }

• Clearly, |Y| < |X|2

• What’s the maximum |Y| with respect to |X|?

• Turns out to be rather challenging

• Daniel Kane and Terence Tao proved: |Y|=O(|X|1.73)

23