tandy warnow the university of illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups:...

65
Genome-scale Es-ma-on of the Tree of Life Tandy Warnow The University of Illinois

Upload: others

Post on 11-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Genome-scaleEs-ma-onoftheTreeofLife

TandyWarnowTheUniversityofIllinois

Page 2: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Phylogeny(evolu9onarytree)

Page 3: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Phylogenomics = Species trees from whole genomes

“Nothinginbiologymakessenseexceptinthelightofevolu9on”-Dobhzansky

Page 4: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Scien9ficchallenges:•  Ultra-largemul9ple-sequencealignment•  Alignment-freephylogenyes9ma9on•  Supertreees9ma9on•  Es9ma9ngspeciestreesfrommanygenetrees•  Genomerearrangementphylogeny•  Re9culateevolu9on•  Visualiza9onoflargetreesandalignments•  Dataminingtechniquestoexploremul9pleop9ma•  Theore9calguaranteesunderMarkovmodelsofevolu9on

Applica9ons:•  metagenomics•  proteinstructureandfunc9onpredic9on•  traitevolu9on•  detec9onofco-evolu9on•  systemsbiology

TheTreeofLife:Mul$pleChallenges

Techniques:•  Graphtheory(especiallychordalgraphs)•  Probabilitytheoryandsta9s9cs•  HiddenMarkovmodels•  Combinatorialop9miza9on•  Heuris9cs•  Supercompu9ng

Page 5: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

phylogenomics

2

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

“gene” here refers to a portion of the genome (not a functional gene)

Orangutan

Gorilla

Chimpanzee

Human

I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome

Page 6: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Gene tree discordance

3

Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1

IncompleteLineageSor9ng(ILS)isadominantcauseofgenetreeheterogeneity

Page 7: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Genetreesinsidethespeciestree(CoalescentProcess)

Present

Past

CourtesyJamesDegnan

GorillaandOrangutanarenotsiblingsinthespeciestree,buttheyareinthegenetree.

Page 8: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

IncompleteLineageSor9ng(ILS)

•  Confoundsphylogene9canalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.

•  Thereissubstan9aldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundsta9s9calconsistencyguarantees(theory)andperformanceondata.

Page 9: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

AvianPhylogenomicsProjectEJarvis,HHMI

GZhang,BGI

• Approx.50species,wholegenomes,14,000loci• Jarvis,Mirarab,etal.,Science2014

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus9nUT-Aus9n

T.WarnowUT-Aus9n

Plusmanymanyotherpeople…

Majorchallenges:•  Concatena9onanalysistook>250CPUyears,andsuggestedarapidradia9on•  Massivegenetreeheterogeneityconsistentwithincompletelineagesor9ng•  Standardcoalescent-basedspeciestreees9ma9onmethodscontradicted

concatena9onanalysisandpriorstudies

Page 10: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

1KP:ThousandTranscriptomeProject

l  103planttranscriptomes,400-800singlecopy“genes”l  Nextphasewillbemuchbiggerl  Wickeh,Mirarabetal.,PNAS2014

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin

Challenges:•  MassivegenetreeheterogeneityconsistentwithILS•  Couldnotuseexis9ngcoalescentmethodsduetomissingdata(manygenetreescouldnotberooted)andlargenumberofspecies

Page 11: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Thistalk•  Genetreeheterogeneityduetoincompletelineagesor9ng,

modelledbythemul9-speciescoalescent(MSC)•  Sta9s9callyconsistentes9ma9onofspeciestreesunder

theMSC,andtheimpactofgenetreees9ma9onerror•  Newmethodsinphylogenomics:

•  Sta9s9calbinning(Science2014)andWeightedSta9s9calBinning(PLOSOne2015):improvinggenetrees

•  ASTRAL(Bioinforma9cs2014,2015):quartet-basedes9ma9on

•  Openques9ons

Page 12: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Samplingmul9plegenesfrommul9plespecies

Page 13: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Aspeciestreedefinesaprobabilitydistribu9onongenetreesundertheMul9-SpeciesCoalescent(MSC)Model

Present

Past

CourtesyJamesDegnan

GorillaandOrangutanarenotsiblingsinthespeciestree,buttheyareinthegenetree.

Page 14: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Sta9s9calConsistency

error

Data

Page 15: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

. . .

Analyzeseparately

Summary Method

Maincompe9ngapproaches gene 1 gene 2 . . . gene k

. . . Concatenation

Species

Page 16: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate
Page 17: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Sta9s9callyconsistentunderMSC?•  CA-ML(Concatena9onusingunpar99onedmaximumlikelihood)-NO

•  Mostfrequentgenetree–NO

•  MinimizeDeepCoalescences(MDC)–NO

•  GreedyConsensus(GC)–NO

•  MatrixRepresenta9onwithParsimony(MRP,supertreemethod)–NO

Hence,noneofthesestandardapproachesareproventoconvergetothetruespeciestreeasthenumberoflociincreases.

Manyofthemareposi9velymisleading(willconvergetothewrongtree)!

Page 18: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Anomalyzone• Ananomalousgenetree(AGT)isonethatismore probablethanthetruespeciestreeunderthemul9- speciescoalescentmodel.

• Theorem(Degnan2013,Rosenberg2013):Forn>3, therearemodelspeciestreeswithrootedAGTs,andforn>4therearemodelspeciestreeswithunrootedAGTs.

Page 19: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Anomalyzone• Ananomalousgenetree(AGT)isonethatismore probablethanthetruespeciestreeunderthemul9- speciescoalescentmodel.

• Theorem(Hudson1983):Therearenorooted3-leaf AGTs.

• Theorem(Allmanetal.2011,Degnan2013):Therearenounrooted4-leafAGTs.

Page 20: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

. . .

SummaryMethods

Page 21: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

. . .

SummaryMethods

Compu9ngrootedspeciestreefromrootedgenetrees:•  Foreverythreespecies{a,b,c},

•  recordmostfrequentrootedgenetreeon{a,b,c}•  Combinerootedthree-leafgenetreesintorootedtreeifthey

arecompa9ble

Theorem:Thisalgorithmissta9s9callyconsistentundertheMSCandrunsinpolynomial9me.

Page 22: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

. . .

SummaryMethods

Compu9ngunrootedspeciestreefromunrootedgenetrees:•  Foreveryfourspecies{a,b,c,d},

•  recordmostfrequentunrootedgenetreeon{a,b,c,d}•  Combineunrootedfour-leafgenetreesintounrootedtreeif

theyarecompa9ble(recursivealgorithmbasedonfindingsiblingpairsandremovingonesibling)

Theorem:Thisalgorithmissta9s9callyconsistentundertheMSCandrunsinpolynomial9me.

Page 23: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Sta9s9callyconsistentunderILS?•  Coalescent-basedsummarymethods:

–  MP-EST(Liuetal.2010):maximumpseudo-likelihoodes9ma9onofrootedspeciestreebasedonrootedtriplettreedistribu9on–YES

–  BUCKy-pop(AnéandLarget2010):quartet-basedBayesianspeciestreees9ma9on–YES

–  Andmanyothers(ASTRAL,ASTRID,NJst,GLASS,etc.)-YES

•  Co-es-ma-onmethods:*BEAST(HeledandDrummond2009):Bayesianco-es9ma9onofgenetreesandspeciestrees–YES

Co-es9ma9onmethodsaretooslowtouseonmostdatasets…hencethedebateislargelybetweenconcatena9on(tradi9onalapproach)andsummarymethods.

•  Single-sitemethods(SMRT,SVDquartets,METAL,SNAPP,andothers)-YES

•  CA-ML(Concatena9onusingunpar99onedmaximumlikelihood)-NO

•  MDC–NO

•  GC(GreedyConsensus)–NO

•  MRP(supertreemethod)–NO

Page 24: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Sta9s9callyconsistentunderILS?•  Coalescent-basedsummarymethods:

–  MP-EST(Liuetal.2010):maximumpseudo-likelihoodes9ma9onofrootedspeciestreebasedonrootedtriplettreedistribu9on–YES

–  BUCKy-pop(AnéandLarget2010):quartet-basedBayesianspeciestreees9ma9on–YES

–  Andmanyothers(ASTRAL,ASTRID,NJst,GLASS,etc.)-YES

•  Co-es-ma-onmethods:*BEAST(HeledandDrummond2009):Bayesianco-es9ma9onofgenetreesandspeciestrees–YES

Co-es9ma9onmethodsaretooslowtouseonmostdatasets…hencethedebateislargelybetweenconcatena9on(tradi9onalapproach)andsummarymethods.

•  Single-sitemethods(SMRT,SVDquartets,METAL,SNAPP,andothers)-YES

•  CA-ML(Concatena9onusingunpar99onedmaximumlikelihood)-NO

•  MDC–NO

•  GC(GreedyConsensus)–NO

•  MRP(supertreemethod)–NO

Page 25: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Sta9s9callyconsistentunderILS?•  Coalescent-basedsummarymethods:

–  MP-EST(Liuetal.2010):maximumpseudo-likelihoodes9ma9onofrootedspeciestreebasedonrootedtriplettreedistribu9on–YES

–  BUCKy-pop(AnéandLarget2010):quartet-basedBayesianspeciestreees9ma9on–YES

–  Andmanyothers(ASTRAL,ASTRID,NJst,GLASS,etc.)-YES

•  Co-es-ma-onmethods:*BEAST(HeledandDrummond2009):Bayesianco-es9ma9onofgenetreesandspeciestrees–YES

Co-es9ma9onmethodsaretooslowtouseonmostdatasets…hencethedebateislargelybetweenconcatena9on(tradi9onalapproach)andsummarymethods.

•  Single-sitemethods(SMRT,SVDquartets,METAL,SNAPP,andothers)-YES

•  CA-ML(Concatena9onusingunpar99onedmaximumlikelihood)-NO

•  MDC–NO

•  GC(GreedyConsensus)–NO

•  MRP(supertreemethod)–NO

Page 26: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Resultson11-taxondatasetswithweakILS

*BEASTmoreaccuratethansummarymethods(MP-EST,BUCKy,etc)CA-ML(concatenatedanalysis)mostaccurate

DatasetsfromChungandAné,2011 Bayzid&Warnow,Bioinforma9cs2013

0

0.05

0.1

0.15

0.2

0.25

5−genes 10−genes 25−genes 50−genes

Aver

age

FN

rat

e *BEAST

CA−ML

BUCKy−con

BUCKy−pop

MP−EST

Phylo−exact

MRP

GC

Page 27: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Resultson11-taxondatasetswithweakILS

*BEASTmoreaccuratethansummarymethods(MP-EST,BUCKy,etc)CA-ML(concatenatedanalysis)mostaccurate

DatasetsfromChungandAné,2011 Bayzid&Warnow,Bioinforma9cs2013

0

0.05

0.1

0.15

0.2

0.25

5−genes 10−genes 25−genes 50−genes

Aver

age

FN

rat

e *BEAST

CA−ML

BUCKy−con

BUCKy−pop

MP−EST

Phylo−exact

MRP

GC

*BEASTMOREACCURATEthansummarymethods,because*BEASTgetsmoreaccurategenetrees!

Page 28: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Resultson11-taxondatasetswithweakILS

*BEASTmoreaccuratethansummarymethods(MP-EST,BUCKy,etc)CA-ML(concatenatedanalysis)mostaccurate

DatasetsfromChungandAné,2011 Bayzid&Warnow,Bioinforma9cs2013

0

0.05

0.1

0.15

0.2

0.25

5−genes 10−genes 25−genes 50−genes

Aver

age

FN

rat

e *BEAST

CA−ML

BUCKy−con

BUCKy−pop

MP−EST

Phylo−exact

MRP

GC

Summarymethods(BUCKy-pop,MP-EST)arebothsta9s9callyconsistentundertheMSCbutareimpactedbygenetreees9ma9onerror

Page 29: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Resultson11-taxondatasetswithweakILS

*BEASTmoreaccuratethansummarymethods(MP-EST,BUCKy,etc)CA-ML(concatenatedanalysis)mostaccurate

DatasetsfromChungandAné,2011 Bayzid&Warnow,Bioinforma9cs2013

0

0.05

0.1

0.15

0.2

0.25

5−genes 10−genes 25−genes 50−genes

Aver

age

FN

rat

e *BEAST

CA−ML

BUCKy−con

BUCKy−pop

MP−EST

Phylo−exact

MRP

GC

Concatena9on(RAxML)bestofallmethodsonthesedata!(However,forhighenoughILS,concatena9onisnotasaccurateasthebestsummarymethods.)

Page 30: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

ImpactofGeneTreeEs9ma9onErroronMP-EST

MP-ESThasnoerrorontruegenetrees,butMP-ESThas9%errorones-matedgenetrees

Datasets:11-taxonstrongILScondi9onswith50genesSimilarresultsforothersummarymethods(MDC,Greedy,etc.)

0

0.05

0.1

0.15

0.2

0.25

MP−EST

Aver

age

FN

rat

e

trueestimated

Page 31: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

•  Summarymethodscombinees9matedgenetrees,nottruegenetrees.

•  Mul9plestudiesshowthatsummarymethodscanbelessaccuratethanconcatena9oninthepresenceofhighgenetreees9ma9onerror.

•  Genome-scaledataincludesarangeofmarkers,notallofwhichhavesubstan9alsignal.Furthermore,removingsitesduetomodelviola9onsreducessignal.

•  Someresearchersalsoarguethat“genetrees”shouldbebasedonveryshortalignments,toavoidintra-locusrecombina9on.

TYPICALPHYLOGENOMICSPROBLEM: manypoorgenetrees

Page 32: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

•  Summarymethodscombinees9matedgenetrees,nottruegenetrees.

•  Mul9plestudiesshowthatsummarymethodscanbelessaccuratethanconcatena9oninthepresenceofhighgenetreees9ma9onerror.

•  Genome-scaledataincludesarangeofmarkers,notallofwhichhavesubstan9alsignal.Furthermore,removingsitesduetomodelviola9onsreducessignal.

•  Someresearchersalsoarguethat“genetrees”shouldbebasedonveryshortalignments,toavoidintra-locusrecombina9on.

Genetreees9ma9onerror:keyissueinthedebate

Page 33: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

•  Ques9on:Doanysummarymethodsconvergetothespeciestreeasthenumberoflociincrease,butwhereeachlocushasonlyaconstantnumberofsites?

•  Answers:Roch&Warnow,SystBiol,March2015:–  Strictmolecularclock:Yesforsomenewmethods,evenforasinglesiteperlocus

– Noclock:Unknownforallmethods,including MP-EST,ASTRAL,etc.

S.RochandT.Warnow."Ontherobustnesstogenetreees9ma9onerror(orlackthereof)ofcoalescent-basedspeciestreemethods",Systema9cBiology,64(4):663-676,2015,(PDF)

Whatistheimpactofgenetreees9ma9onerroronspeciestreees9ma9on?

Page 34: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Avian Phylogenomics Project Erich Jarvis, HHMI

Guojie Zhang, BGI

•  Approx. 50 species, whole genomes •  14,000 loci •  Multi-national team (100+ investigators) •  8 papers published in special issue of Science 2014

Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing “coalescent-based” species tree from 14,000 different gene trees

MTP Gilbert, Copenhagen

Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC

Page 35: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

90. J. F. Storz, J. C. Opazo, F. G. Hoffmann, Mol. Phylogenet. Evol.66, 469–478 (2013).

91. F. G. Hoffmann, J. F. Storz, T. A. Gorr, J. C. Opazo, Mol. Biol.Evol. 27, 1126–1138 (2010).

ACKNOWLEDGMENTS

Genome assemblies and annotations of avian genomes in thisstudy are available on the avian phylogenomics website(http://phybirds.genomics.org.cn), GigaDB (http://dx.doi.org/10.5524/101000), National Center for Biotechnology Information(NCBI), and ENSEMBL (NCBI and Ensembl accession numbersare provided in table S2). The majority of this study wassupported by an internal funding from BGI. In addition, G.Z. wassupported by a Marie Curie International Incoming Fellowshipgrant (300837); M.T.P.G. was supported by a Danish NationalResearch Foundation grant (DNRF94) and a Lundbeck Foundationgrant (R52-A5062); C.L. and Q.L. were partially supported by aDanish Council for Independent Research Grant (10-081390);and E.D.J. was supported by the Howard Hughes Medical Instituteand NIH Directors Pioneer Award DP1OD000448.

The Avian Genome ConsortiumChen Ye,1 Shaoguang Liang,1 Zengli Yan,1 M. Lisandra Zepeda,2

Paula F. Campos,2 Amhed Missael Vargas Velazquez,2

José Alfredo Samaniego,2 María Avila-Arcos,2 Michael D. Martin,2

Ross Barnett,2 Angela M. Ribeiro,3 Claudio V. Mello,4 Peter V. Lovell,4

Daniela Almeida,3,5 Emanuel Maldonado,3 Joana Pereira,3

Kartik Sunagar,3,5 Siby Philip,3,5 Maria Gloria Dominguez-Bello,6

Michael Bunce,7 David Lambert,8 Robb T. Brumfield,9

Frederick H. Sheldon,9 Edward C. Holmes,10 Paul P. Gardner,11

Tammy E. Steeves,11 Peter F. Stadler,12 Sarah W. Burge,13

Eric Lyons,14 Jacqueline Smith,15 Fiona McCarthy,16

Frederique Pitel,17 Douglas Rhoads,18 David P. Froman19

1China National GeneBank, BGI-Shenzhen, Shenzhen 518083,China. 2Centre for GeoGenetics, Natural History Museum ofDenmark, University of Copenhagen, Øster Voldgade 5-7, 1350Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar deInvestigação Marinha e Ambiental, Universidade do Porto, Ruados Bragas, 177, 4050-123 Porto, Portugal. 4Department ofBehavioral Neuroscience Oregon Health & Science UniversityPortland, OR 97239, USA. 5Departamento de Biologia, Faculdadede Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal. 6Department of Biology, University of PuertoRico, Av Ponce de Leon, Rio Piedras Campus, JGD 224, San Juan,PR 009431-3360, USA. 7Trace and Environmental DNA laboratory,Department of Environment and Agriculture, Curtin University, Perth,Western Australia 6102, Australia. 8Environmental Futures ResearchInstitute, Griffith University, Nathan, Queensland 4121, Australia.9Museum of Natural Science, Louisiana State University, BatonRouge, LA 70803, USA. 10Marie Bashir Institute for InfectiousDiseases and Biosecurity, Charles Perkins Centre, School ofBiological Sciences and Sydney Medical School, The University ofSydney, Sydney NSW 2006, Australia. 11School of BiologicalSciences, University of Canterbury, Christchurch 8140, New Zealand.12Bioinformatics Group, Department of Computer Science, andInterdisciplinary Center for Bioinformatics, University of Leipzig,Hr̈telstrasse 16-18, D-04107 Leipzig, Germany. 13European MolecularBiology Laboratory, European Bioinformatics Institute, Hinxton,Cambridge CB10 1SD, UK. 14School of Plant Sciences, BIO5 Institute,University of Arizona, Tucson, AZ 85721, USA. 15Division of Geneticsand Genomics, The Roslin Institute and Royal (Dick) School ofVeterinary Studies, The Roslin Institute Building, University ofEdinburgh, Easter Bush Campus, Midlothian EH25 9RG, UK.16Department of Veterinary Science and Microbiology, University ofArizona, 1117 E Lowell Street, Post Office Box 210090-0090, Tucson,AZ 85721, USA. 17Laboratoire de Génétique Cellulaire, INRA Cheminde Borde-Rouge, Auzeville, BP 52627 , 31326 CASTANET-TOLOSANCEDEX, France. 18Department of Biological Sciences, Science andEngineering 601, University of Arkansas, Fayetteville, AR 72701, USA.19Department of Animal Sciences, Oregon State University, Corvallis,OR 97331, USA.

SUPPLEMENTARY MATERIALS

www.sciencemag.org/content/346/6215/1311/suppl/DC1Supplementary TextFigs. S1 to S42Tables S1 to S51References (92–192)

27 January 2014; accepted 6 November 201410.1126/science.1251385

RESEARCH ARTICLE

Whole-genome analyses resolveearly branches in the tree of lifeof modern birdsErich D. Jarvis,1*† Siavash Mirarab,2* Andre J. Aberer,3 Bo Li,4,5,6 Peter Houde,7

Cai Li,4,6 Simon Y. W. Ho,8 Brant C. Faircloth,9,10 Benoit Nabholz,11

Jason T. Howard,1 Alexander Suh,12 Claudia C. Weber,12 Rute R. da Fonseca,6

Jianwen Li,4 Fang Zhang,4 Hui Li,4 Long Zhou,4 Nitish Narula,7,13 Liang Liu,14

Ganesh Ganapathy,1 Bastien Boussau,15 Md. Shamsuzzoha Bayzid,2

Volodymyr Zavidovych,1 Sankar Subramanian,16 Toni Gabaldón,17,18,19

Salvador Capella-Gutiérrez,17,18 Jaime Huerta-Cepas,17,18 Bhanu Rekepalli,20

Kasper Munch,21 Mikkel Schierup,21 Bent Lindow,6 Wesley C. Warren,22

David Ray,23,24,25 Richard E. Green,26 Michael W. Bruford,27 Xiangjiang Zhan,27,28

Andrew Dixon,29 Shengbin Li,30 Ning Li,31 Yinhua Huang,31

Elizabeth P. Derryberry,32,33 Mads Frost Bertelsen,34 Frederick H. Sheldon,33

Robb T. Brumfield,33 Claudio V. Mello,35,36 Peter V. Lovell,35 Morgan Wirthlin,35

Maria Paula Cruz Schneider,36,37 Francisco Prosdocimi,36,38 José Alfredo Samaniego,6

Amhed Missael Vargas Velazquez,6 Alonzo Alfaro-Núñez,6 Paula F. Campos,6

Bent Petersen,39 Thomas Sicheritz-Ponten,39 An Pas,40 Tom Bailey,41 Paul Scofield,42

Michael Bunce,43 David M. Lambert,16 Qi Zhou,44 Polina Perelman,45,46

Amy C. Driskell,47 Beth Shapiro,26 Zijun Xiong,4 Yongli Zeng,4 Shiping Liu,4

Zhenyu Li,4 Binghang Liu,4 Kui Wu,4 Jin Xiao,4 Xiong Yinqi,4 Qiuemei Zheng,4

Yong Zhang,4 Huanming Yang,48 Jian Wang,48 Linnea Smeds,12 Frank E. Rheindt,49

Michael Braun,50 Jon Fjeldsa,51 Ludovic Orlando,6 F. Keith Barker,52

Knud Andreas Jønsson,51,53,54 Warren Johnson,55 Klaus-Peter Koepfli,56

Stephen O’Brien,57,58 David Haussler,59 Oliver A. Ryder,60 Carsten Rahbek,51,54

Eske Willerslev,6 Gary R. Graves,51,61 Travis C. Glenn,62 John McCormack,63

Dave Burt,64 Hans Ellegren,12 Per Alström,65,66 Scott V. Edwards,67

Alexandros Stamatakis,3,68 David P. Mindell,69 Joel Cracraft,70 Edward L. Braun,71

Tandy Warnow,2,72† Wang Jun,48,73,74,75,76† M. Thomas P. Gilbert,6,43† Guojie Zhang4,77†

To better determine the history of modern birds, we performed a genome-scale phylogeneticanalysis of 48 species representing all orders of Neoaves using phylogenomic methodscreated to handle genome-scale data. We recovered a highly resolved tree that confirmspreviously controversial sister or close relationships. We identified the first divergence inNeoaves, two groups we named Passerea and Columbea, representing independent lineagesof diverse and convergently evolved land and water bird species. Among Passerea, we inferthe common ancestor of core landbirds to have been an apex predator and confirm independentgains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging tosister clades. Even with whole genomes, some of the earliest branches in Neoaves provedchallenging to resolve, which was best explained by massive protein-coding sequenceconvergence and high levels of incomplete lineage sorting that occurred during a rapidradiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago.

The diversification of species is not alwaysgradual but can occur in rapid radiations,especially aftermajor environmental changes(1, 2). Paleobiological (3–7) and molecular (8)evidence suggests that such “big bang” radia-

tions occurred for neoavian birds (e.g., songbirds,parrots, pigeons, and others) and placental mam-mals, representing 95% of extant avian and mam-malian species, after the Cretaceous to Paleogene(K-Pg)mass extinction event about 66million yearsago (Ma). However, other nuclear (9–12) and mito-chondrial (13, 14) DNA studies propose an earlier,more gradual diversification, beginning withinthe Cretaceous 80 to 125 Ma. This debate is con-founded by findings that different data sets (15–19)and analytical methods (20, 21) often yield con-

trasting species trees. Resolving such timing andphylogenetic relationships is important for com-parative genomics,which can informabout humantraits and diseases (22).Recent avian studies based on fragments of 5

[~5000 base pairs (bp) (8)] and 19 [31,000 bp (17)]genes recovered some relationships inferred frommorphological data (15, 23) and DNA-DNA hy-bridization (24), postulated new relationships,and contradicted many others. Consistent withmost previous molecular and contemporary mor-phological studies (15), they divided modernbirds (Neornithes) into Palaeognathae (tinamousand flightless ratites), Galloanseres [Galliformes(landfowl) and Anseriformes (waterfowl)], andNeoaves (all other extant birds). Within Neoaves,

1320 12 DECEMBER 2014 • VOL 346 ISSUE 6215 sciencemag.org SCIENCE

A FLOCK OF GENOMES

Jarvis,$Mirarab,$et$al.,$examined$48$

bird$species$using$14,000$loci$from$

whole$genomes.$Two$trees$were$

presented.$

$

1.$A$single$dataset$maximum$

likelihood$concatena,on$analysis$

used$~300$CPU$years$and$1Tb$of$

distributed$memory,$using$TACC$and$

other$supercomputers$around$the$

world.$$

$

2.$However,$every%locus%had%a%different%%tree$–$sugges,ve$of$“incomplete$lineage$sor,ng”$–$and$

the$noisy$genomeHscale$data$required$

the$development$of$a$new$method,$

“sta,s,cal$binning”.$

$

$

$

$

Only48species,butheuris9cMLtook~300CPUyearsonmul9plesupercomputersandused1Tbofmemory

Page 36: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

12 DECEMBER 2014 • VOL 346 ISSUE 6215 1337SCIENCE sciencemag.org

INTRODUCTION: Reconstructing species

trees for rapid radiations, as in the early

diversification of birds, is complicated by

biological processes such as incomplete

lineage sorting (ILS)

that can cause differ-

ent parts of the ge-

nome to have different

evolutionary histories.

Statistical methods,

based on the multispe-

cies coalescent model and that combine

gene trees, can be highly accurate even

in the presence of massive ILS; however,

these methods can produce species trees

that are topologically far from the species

tree when estimated gene trees have error.

We have developed a statistical binning

technique to address gene tree estimation

error and have explored its use in genome-

scale species tree estimation with MP-EST,

a popular coalescent-based species tree

estimation method.

Statistical binning enables an

accurate coalescent-based estimation

of the avian tree

AVIAN GENOMICS

Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow*

RESEARCH ARTICLE SUMMARY

The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for

combinabilty, before estimating gene trees.

Statistical binning technique

Statistical binning pipeline

Traditional pipeline (unbinned)

Sequence data

Incompatibility graph

Gene alignments

Binned supergene alignments

Estimated gene trees

Supergene trees

Species tree

Species tree

RATIONALE: In statistical binning, phy-

logenetic trees on different genes are es-

timated and then placed into bins, so that

the differences between trees in the same

bin can be explained by estimation error

(see the figure). A new tree is then esti-

mated for each bin by applying maximum

likelihood to a concatenated alignment of

the multiple sequence alignments of its

genes, and a species tree is estimated us-

ing a coalescent-based species tree method

from these supergene trees.

RESULTS: Under realistic conditions in

our simulation study, statistical binning

reduced the topological error of species

trees estimated using MP-EST and enabled

a coalescent-based analysis that was more

accurate than concatenation even when

gene tree estimation error was relatively

high. Statistical binning also reduced the

error in gene tree topology and species

tree branch length estimation, especially

when the phylogenetic signal in gene se-

quence alignments was low. Species trees

estimated using MP-EST with statisti-

cal binning on four biological data sets

showed increased concordance with the

biological literature. When MP-EST was

used to analyze 14,446 gene trees in the

avian phylogenomics project, it produced

a species tree that was discordant with the

concatenation analysis and conflicted with

prior literature. However, the statistical

binning analysis produced a tree that was

highly congruent with the concatenation

analysis and was consistent with the prior

scientific literature.

CONCLUSIONS: Statistical binning re-

duces the error in species tree topology

and branch length estimation because

it reduces gene tree estimation error.

These improvements are greatest when

gene trees have reduced bootstrap sup-

port, which was the case for the avian

phylogenomics project. Because using

unbinned gene trees can result in over-

estimation of ILS, statistical binning may

be helpful in providing more accurate

estimations of ILS levels in biological

data sets. Thus, statistical binning enables

highly accurate species tree estimations,

even on genome-scale data sets. �

The list of author affiliations is available in the full article online.

*Corresponding author. E-mail: [email protected] this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463

Read the full article

at http://dx.doi

.org/10.1126/

science.1250463

ON OUR WEB SITE

Published by AAAS

on

Janu

ary

7, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Janu

ary

7, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Janu

ary

7, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Janu

ary

7, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Janu

ary

7, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Janu

ary

7, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Janu

ary

7, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Janu

ary

7, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Janu

ary

7, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Janu

ary

7, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

We$used$100$CPU$$

years$(mostly$on$$

TACC)$to$develop$$

and$test$this$$

method.$

Page 37: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Ideasbehindsta9s9calbinning

Numberofsitesinanalignment

•  “Genetree”errortendstodecreasewiththenumberofsitesinthealignment

•  Concatena9on(evenifnotsta9s9callyconsistent)tendstobereasonablyaccuratewhenthereisnottoomuchgenetreeheterogeneity

Page 38: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

12 DECEMBER 2014 • VOL 346 ISSUE 6215 1337SCIENCE sciencemag.org

INTRODUCTION: Reconstructing species

trees for rapid radiations, as in the early

diversification of birds, is complicated by

biological processes such as incomplete

lineage sorting (ILS)

that can cause differ-

ent parts of the ge-

nome to have different

evolutionary histories.

Statistical methods,

based on the multispe-

cies coalescent model and that combine

gene trees, can be highly accurate even

in the presence of massive ILS; however,

these methods can produce species trees

that are topologically far from the species

tree when estimated gene trees have error.

We have developed a statistical binning

technique to address gene tree estimation

error and have explored its use in genome-

scale species tree estimation with MP-EST,

a popular coalescent-based species tree

estimation method.

Statistical binning enables an

accurate coalescent-based estimation

of the avian tree

AVIAN GENOMICS

Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow*

RESEARCH ARTICLE SUMMARY

The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for

combinabilty, before estimating gene trees.

Statistical binning technique

Statistical binning pipeline

Traditional pipeline (unbinned)

Sequence data

Incompatibility graph

Gene alignments

Binned supergene alignments

Estimated gene trees

Supergene trees

Species tree

Species tree

RATIONALE: In statistical binning, phy-

logenetic trees on different genes are es-

timated and then placed into bins, so that

the differences between trees in the same

bin can be explained by estimation error

(see the figure). A new tree is then esti-

mated for each bin by applying maximum

likelihood to a concatenated alignment of

the multiple sequence alignments of its

genes, and a species tree is estimated us-

ing a coalescent-based species tree method

from these supergene trees.

RESULTS: Under realistic conditions in

our simulation study, statistical binning

reduced the topological error of species

trees estimated using MP-EST and enabled

a coalescent-based analysis that was more

accurate than concatenation even when

gene tree estimation error was relatively

high. Statistical binning also reduced the

error in gene tree topology and species

tree branch length estimation, especially

when the phylogenetic signal in gene se-

quence alignments was low. Species trees

estimated using MP-EST with statisti-

cal binning on four biological data sets

showed increased concordance with the

biological literature. When MP-EST was

used to analyze 14,446 gene trees in the

avian phylogenomics project, it produced

a species tree that was discordant with the

concatenation analysis and conflicted with

prior literature. However, the statistical

binning analysis produced a tree that was

highly congruent with the concatenation

analysis and was consistent with the prior

scientific literature.

CONCLUSIONS: Statistical binning re-

duces the error in species tree topology

and branch length estimation because

it reduces gene tree estimation error.

These improvements are greatest when

gene trees have reduced bootstrap sup-

port, which was the case for the avian

phylogenomics project. Because using

unbinned gene trees can result in over-

estimation of ILS, statistical binning may

be helpful in providing more accurate

estimations of ILS levels in biological

data sets. Thus, statistical binning enables

highly accurate species tree estimations,

even on genome-scale data sets. �

The list of author affiliations is available in the full article online.

*Corresponding author. E-mail: [email protected] this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463

Read the full article

at http://dx.doi

.org/10.1126/

science.1250463

ON OUR WEB SITE

Published by AAAS

on

Oct

ober

14,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n O

ctob

er 1

4, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Oct

ober

14,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n O

ctob

er 1

4, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Oct

ober

14,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n O

ctob

er 1

4, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Oct

ober

14,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n O

ctob

er 1

4, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Oct

ober

14,

201

5w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n O

ctob

er 1

4, 2

015

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Note:Supergenetreescomputedusingfullypar99onedmaximumlikelihoodVertex-coloringgraphwithbalancedcolorclassesisNP-hard;weusedheuris9c.

Page 39: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Sta9s9calbinningvs.unbinned

Datasets:11-taxonstrongILSdatasetswith50genesfromChungandAné,Systema9cBiology

Binningproducesbinswithapproximate5to7geneseach

0

0.05

0.1

0.15

0.2

0.25

MP−EST MDC*(75) MRP MRL GC

Av

erag

e F

N r

ate

UnbinnedStatistical−75

Page 40: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Theorem3(PLOSOne,Bayzidetal.2015):Unweightedsta9s9calbinningpipelinesarenotsta9s9cally

consistentunderGTR+MSC

Asthenumberofsitesperlocusincrease:•  Alles9matedgenetreesconvergetothetruegenetreeandhavebootstrap

supportthatconvergesto1(Steel2014)•  Foreachbin,withprobabilityconvergingto1,thegenesinthebinhavethe

sametreetopology(butcanhavedifferentnumericparameters),andthereisonlyonebinforanygiventreetopology

•  Foreachbin,afullypar99onedmaximumlikelihood(ML)analysisofitssupergenealignmentconvergestoatreewiththecommongenetreetopology.

Asthenumberoflociincrease:•  everygenetreetopologyappearswithprobabilityconvergingto1.Henceasboththenumberoflociandnumberofsitesperlocusincrease,withprobabilityconvergingto1,everygenetreetopologyappearsexactlyonceinthesetofsupergenetrees.Itisimpossibletoinferthespeciestreefromtheflatdistribu9onofgenetrees!

Page 41: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate
Page 42: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Theorem2(PLOSOne,Bayzidetal.2015):WSBpipelinesaresta9s9callyconsistent

underGTR+MSC

Easyproof:Asthenumberofsitesperlocusincrease•  Alles9matedgenetreesconvergetothetruegenetreeandhave

bootstrapsupportthatconvergesto1(Steel2014)•  Foreverybin,withprobabilityconvergingto1,thegenesinthebinhave

thesametreetopology•  Fullypar99onedGTRMLanalysisofeachbinconvergestoatreewiththe

commontopologyofthegenesinthebin

Henceasthenumberofsitesperlocusandnumberoflocibothincrease,WSBfollowedbyasta9s9callyconsistentsummarymethodwillconvergeinprobabilitytothetruespeciestree.Q.E.D.

Page 43: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

WeightedSta9s9calBinning:empirical

WSBgenerallybenigntohighlybeneficial:

•  Improvesaccuracyofgenetreetopology

•  Improvesaccuracyofspeciestreetopology

•  Improvesaccuracyofspeciestreebranchlength

•  Reducesincidenceofhighlysupportedfalseposi9vebranches

Page 44: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

(a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length

(d) MP-EST on varying levels of ILS(c) MP-EST on varying numbers of genes(a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length

(d) MP-EST on varying levels of ILS(c) MP-EST on varying numbers of genes

Speciestreees9ma9onerrorforMP-ESTandASTRAL,andalsoconcatena9onusingML,onaviansimulateddatasets:48taxa,moderatelyhighILS(AD=47%),1000genes,andvaryinggenesequencelength.

Sta-s-calbinningvs.UnbinnedandConcatena-on

Bayzidetal.,(2015).PLoSONE10(6):e0129183

Page 45: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

97/97

Cursores

Columbea

Otidimorphae

Australaves

80/79

73

67

92

79

94

99

68

88

87

9888

50/48 68

86

95

Binned MP-EST (unweighted/weighted) Unbinned MP-EST

Conflict with other lines of strong evidence

Podiceps cristatus9 7/94

PasseriformesPsittaciformesFalco peregrinusCariama cristataCoraciimorphaeAccipitriformesTyto alba

Cariama cristataCoraciimorphae

Pelecanus crispusEgrett agarzettaNipponia nipponPhalacrocorax carboProcellariimorphaeGavia stellataPhaethon lepturusEurypyga heliasBalearica regulorumCharadrius vociferusOpisthocomus hoazin

Calypte annaChaetura pelagicaAntrostomus carolinensis

Tauraco erythrolophusChlamydotis macqueeniiCuculus canorus

Columbal iviaPterocles gutturalisMesitornis unicolor

Phoenicopterus ruber

Meleagris gallopavoGallus gallusAnas platyrhynchos

Struthio camelusTinamus guttatus

91/87

58/56

59/57

99/99

Podiceps cristatusPhoenicopterus ruber

Cuculus canorus

PasseriformesPsittaciformes

Falco peregrinus

AccipitriformesTyto alba

Pelecanus crispusEgrett agarzettaNipponia nippon

Phalacrocorax carboProcellariimorphae

Gavia stellataPhaethon lepturus

Eurypyga heliasBalearica regulorumCharadrius vociferus

Opisthocomus hoazin

Calypte annaChaetura pelagica

Antrostomus carolinensis

Columbal iviaPterocles gutturalisMesitornis unicolor

Meleagris gallopavoGallus gallus

Anas platyrhynchos

Struthio camelusTinamus guttatus

Tauraco erythrolophusChlamydotis macqueenii

88/90100/99

100/99

100/99

ComparingBinnedandUn-binnedMP-ESTontheAvianDataset

UnbinnedMP-ESTstronglyrejectsColumbea,amajorfindingbyJarvis,Mirarab,etal.BinnedMP-ESTislargelyconsistentwiththeMLconcatena9onanalysis.ThetreespresentedinScience2014weretheMLconcatena9onandBinnedMP-EST

Page 46: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

RunningTimeComparison•  Concatena9onanalysisoftheAviandataset:

–  ~250CPUyearsand1Tbmemory•  Sta9s9calbinninganalysis:

–  ~5CPUyears,almostallofwhichwascompu9ngmaximumlikelihoodgenetrees,muchlessmemoryusage

Speciestreees9ma9onusingtradi9onalapproachesismorecomputa9onallyexpensive,andnotasaccurateascoalescent-basedmethods!

Page 47: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Summary(sofar)•  Sta9s9calbinning(weightedorunweighted):improvesgene

trees,andleadstoimprovedspeciestreesinthepresenceofILScomparedtounbinnedanalyses.

•  Sta9s9calbinningpipelinesarealsomoreaccuratethanconcatena9onunderhighILS.

•  Pipelinesusingweightedversionaresta9s9callyconsistentunderthemul9-speciescoalescentmodel.

•  Sta9s9calbinningpipelinesaremuchfasterthanconcatena9onanalyses(e.g.5yearsvs.250yearsforaviandataset).

Page 48: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

1KP:ThousandTranscriptomeProject

l  103planttranscriptomes,400-800singlecopy“genes”l  Wickeh,Mirarabetal.,PNAS2014l  Nextphasewillbemuchbigger(~1000speciesand~1000genes)

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin

Challenges:•  MassivegenetreeheterogeneityconsistentwithILS•  Couldnotuseexis9ngcoalescentmethodsduetomissingdata(manygenetreescouldnotberooted)andlargenumberofspecies

Page 49: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

1KP:ThousandTranscriptomeProject

l  103planttranscriptomes,400-800singlecopy“genes”l  Wickeh,Mirarabetal.,PNAS2014l  Nextphasewillbemuchbigger(~1000speciesand~1000genes)

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin

Solu9on:•  Newcoalescent-basedmethodASTRAL(Mirarabetal.,ECCB/Bioinforma-cs2014,Mirarabetal.,ISMB/Bioinforma-cs2015)

•  ASTRALissta9s9callyconsistent,polynomial9me,andusesunrootedgenetrees.

Page 50: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

ASTRAL [Mirarab, et al., ECCB/Bioinformatics, 2014]

• Optimization Problem (NP-Hard):

• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly

15

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

Set of quartet trees induced by T

a gene tree

Score(T ) =X

t2TQ(T ) \Q(t)

all input gene trees

Page 51: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

ConstrainedMaximumQuartetSupportTree

•  Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbipar99ons

•  Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT, subjecttoTdrawingitsbipar99onsfromX.

Theorems(Mirarabetal.,2014):•  IfXcontainsthebipar99onsfromtheinputgenetrees(and

perhapsothers),thenanexactsolu9ontothisproblemissta9s9callyconsistentundertheMSC.

•  TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)9me.(Weusedynamicprogramming,andbuildtheunrootedtreefromthebohom-up,basedon“allowedclades”–halvesoftheallowedbipar99ons.)

Page 52: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

ConstrainedMaximumQuartetSupportTree

•  Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbipar99ons

•  Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT, subjecttoTdrawingitsbipar99onsfromX.

Theorems(Mirarabetal.,2014):•  IfXcontainsthebipar99onsfromtheinputgenetrees(and

perhapsothers),thenanexactsolu9ontothisproblemissta9s9callyconsistentundertheMSC.

•  TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)9me.(Weusedynamicprogramming,andbuildtheunrootedtreefromthebohom-up,basedon“allowedclades”–halvesoftheallowedbipar99ons.)

Page 53: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

ConstrainedMaximumQuartetSupportTree

•  Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbipar99ons

•  Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT, subjecttoTdrawingitsbipar99onsfromX.

Theorems(Mirarabetal.,2014):•  IfXcontainsthebipar99onsfromtheinputgenetrees(and

perhapsothers),thenanexactsolu9ontothisproblemissta9s9callyconsistentundertheMSC.

•  TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)9me.(Weusedynamicprogramming,andbuildtheunrootedtreefromthebohom-up,basedon“allowedclades”–halvesoftheallowedbipar99ons.)

Page 54: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Simulation study• Variable parameters:

• Number of species: 10 – 1000

• Number of genes: 50 – 1000

• Amount of ILS: low, medium, high

• Deep versus recent speciation

• 11 model conditions (50 replicas each) with heterogenous gene tree error

• Compare to NJst, MP-EST, concatenation (CA-ML)

• Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree

14

Truegenetrees Sequencedata

Es�matedspeciestree

Finch Falcon Owl Eagle Pigeon

Es�matedgenetreesFinch Owl Falcon Eagle Pigeon

True(model)speciestree

ASTRAL-II

look at all pairs of leaves chosen each from one of the children ofu. For each such pair of leaves, there are

�u0

2

�quartet trees that put

that pair together, where u0 is the number of leaves outside the nodeu. This will examine each pair of nodes in each of the input k nodesexactly once and would therefore require O(n2k) computations.The final score can be normalized by the maximum number of inputquartet trees that include a pair of taxa.

Given the similarity matrix, we calculate an UPGMA tree andadd all its bipartitions to the set X. This heuristic adds relatively fewbipartitions, but the matrix is used in the next heuristic, which is ourmain addition mechanism.

Greedy: We estimate the greedy consensus of the gene trees atdifferent threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3).For each polytomy in each greedy consensus tree, we resolve thepolytomy in multiple ways and add bipartitions implied by thoseresolutions to the set X. First, we resolve the polytomy by applyingthe UPGMA algorithm to the similarity matrix, starting from theclades given by the polytomy. Then, we sample one taxon fromeach side of the ploytomy randomly, and use the greedy consensusof the gene trees restricted to this subsample to find a resolutionof the polytomy (we randomly resolve any multifunctions in thisgreedy consensus on indued subsample). We repeat this process atleast 10 times, but if the subsampled greedy consensus trees includesufficiently frequent bipartitions (defined as > 1%), we do morerounds of random sampling (we increase the number of iterationsby two every time this happens). For each random subsamplearound a polytomy, we also resolve it by calculating an UPGMAtree on the subsampled similarity matrix. Finally, for the two firstgreedy threshold values and the first 10 random subsamples, wealso use a third strategy that can potentially add a larger number ofbipartitions. For each subsampled taxon x, we resolve the polytomyas a caterpillar tree by sorting the remaining taxa according to theirsimilarity with x.

Gene tree polytomies: When gene trees include polytomies, wealso add new bipartitions to set X. We first compute the greedyconsensus of the input gene trees with threshold 0 and if thegreedy consensus has polytomies, we resolve them using UPGMA;we repeat this process twice to account for uncertainty in greedyconsensus estimation. Then, for each gene tree polytomy, we use thetwo resolved consensus trees to infer a resolution of the polytomyand we add the implied resolutions to set X.

3.3 Multi-furcating input gene trees

Extending ASTRAL to inputs that include polytomies requiressolving the weighted quartet tree problem when each node of theinput defines not a tripartition, but a multi-partition of the setof taxa. We start by a basic observation: every resolved quartettree induced by a gene tree maps to two nodes in the gene treeregardless of whether the gene tree is binary or not. In other words,induced quartet trees that map to only one node of the gene tree areunresolved. When maximizing the quartet support, these unresolvedgene tree quartet trees are inconsequential and need to be ignored.Now, consider a polytomy of degree d. There are

�d3

�ways to select

three sides of the polytomy. Each of these ways of selecting threesides defines a tripartition of a subset of taxa. Any selection of twotaxa from one side of this tripartition and one taxon from each of theremaining two sides still defines an induced resolved quartet tree,

0

5

10

15

20

0% 20% 40% 60% 80%RF distance (true species tree vs true gene trees)

dens

ity

rate1e−06 1e−07

tree height10M 2M 500K

(a) True gene tree discordance

0

1

2

3

4

0% 25% 50% 75% 100%RF distance (true vs estimated)

dens

ity

(b) Gene tree estimation error

Fig. 1. Characteristics of the simulation (a) RF distance between the truespecies tree and the true gene trees (50 replicates of 1000 genes) for DatasetI. Tree height directly affects the amount of true discordance; the speciationrate affects true gene tree discordance only with 10M tree length. (b) RFdistance between true gene trees and estimated gene trees for Dataset I. Seealso Figure S1 for inter and intra-replicate gene tree error distributions.

and each induced resolved quartet tree would still map to exactlytwo nodes in our multi-furcating tree. Thus, all the algorithmicassumptions of ASTRAL remain intact, as long as for each multi-furcating node in an input gene tree, we treat it as a collection of

�d3

tripartitions. Note that in the presence of polytomies, the runningtime analysis can change because analyzing each multi-furcatingnode requires time cubic in its degree and the degree can increase inprinciple with n. Thus, the running time depends on the patterns ofthe multi-furcations and cannot be studied in a general case.

Statistical Consistency: ASTRAL-I was statistically consistent, andchanges from ASTRAL-I to ASTRAL-II either affect running time,or enlarge the search space, which does not negate consistency.

Theorem 3: ASTRAL-II is statistically consistent for binarycomplete input gene trees.

4 EXPERIMENTAL SETUPSimulation Procedure: We used SimPhy, a tool developed by Malloet al. (2015), to simulate species trees and gene trees (producedin mutation units), and then used Indelible to simulate sequencesdown the gene trees with varying length and model parameters. Weestimated gene trees on these simulated gene alignments, which wethen used in coalescent-based analyses.

We simulated 10 model conditions, which we divide into twodatasets, with one model condition appearing in both datasets. We

3

UsedSimPhy,MalloandPosada,2015

Page 55: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

16

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

Page 56: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

16

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

Page 57: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate
Page 58: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

200 Estimated Gene Trees

Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees,simulated 1000 bp gene sequences using INDELible 8, 1000 gene trees estimated from GTR simulated sequences using FastTree-27

7Price, Dehal, Arkin 20158Fletcher, Yang 2009

12

AccuracyinthepresenceofHGT+ILS

Davidsonetal.,RECOMB-CG,BMCGenomics2015

Page 59: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate
Page 60: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate
Page 61: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Summary•  ASTRALisasummarymethodsthatissta9s9callyconsistentin

thepresenceofILS,andthatruninpolynomial9me.ASTRALcananalyzeverylargedatasets(1000speciesand1000genes–ormore)withhighaccuracy.

•  Coalescent-basedsummarymethodsaremuchfasterthantradi9onalconcatena9onapproaches,andtheycanprovideimprovedaccuracyinthepresenceofgenetreeheterogeneity.

•  Genetreees9ma9onerrorimpactsaccuracyofspeciestrees–butsta9s9calbinningcanreducegenetreees9ma9onerror,andleadtoimprovedspeciestreees9ma9ons(topology,branchlengths,andincidenceoffalseposi9ves).

Page 62: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

FutureDirec9ons

•  Behercoalescent-basedsummarymethods(thataremorerobusttogenetreees9ma9onerror)

•  Behertechniquesfores9ma9nggenetreesgivenmul9-locusdata,orforco-es9ma9nggenetreesandspeciestrees

•  Behertheoryaboutrobustnesstogenetreees9ma9onerror(orlackthereof)forcoalescent-basedsummarymethods

•  Beher“singlesite”methods(seeSMRT,SVDquartets,METAL,andSNAPP)

Page 63: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Scien9ficchallenges:•  Ultra-largemul9ple-sequencealignment•  Alignment-freephylogenyes9ma9on•  Supertreees9ma9on•  Es9ma9ngspeciestreesfrommanygenetrees•  Genomerearrangementphylogeny•  Re9culateevolu9on•  Visualiza9onoflargetreesandalignments•  Dataminingtechniquestoexploremul9pleop9ma•  Theore9calguaranteesunderMarkovmodelsofevolu9on

Applica9ons:•  metagenomics•  proteinstructureandfunc9onpredic9on•  traitevolu9on•  detec9onofco-evolu9on•  systemsbiology

TheTreeofLife:Mul$pleChallenges

Techniques:•  Graphtheory(especiallychordalgraphs)•  Probabilitytheoryandsta9s9cs•  HiddenMarkovmodels•  Combinatorialop9miza9on•  Heuris9cs•  Supercompu9ng

Page 64: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

BigPhylogenomicData•  Thousandstomillionsofsequencesandspecies,millionsof

sitesperspecies•  Bigphylogenomicdataarenotthesameastheusualdata•  Rela9veperformanceofmethodschangewithdatasetsize

andheterogeneity•  Evenmoderate-sizedinputscancreatehugeoutputs

– Weneednewmethods,newop$miza$onproblems,newsta$s$calmodels,newsta$s$caltheory,compressionmethods,visualiza$onmethods,….

Page 65: Tandy Warnow The University of Illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substan9al debate

Acknowledgments

NSFgrantDBI-1461364(jointwithNoahRosenbergatStanfordandLuayNakhlehatRice):hhp://tandy.cs.illinois.edu/PhylogenomicsProject.html

Papersavailableathhp://tandy.cs.illinois.edu/papers.htmlSoSwareASTRALandsta-s-calbinning:Availableathhps://github.com/smirarabOthersathhp://tandy.cs.illinois.edu/so�ware.htmlOtherFunding:DavidBrutonJr.CentennialProfessorship,TACC(TexasAdvancedCompu9ngCenter),GraingerFounda9on,andHHMI(toSM)