tandy warnow the university of illinoistandy.cs.illinois.edu/warnow-imperial-v2.pdf · groups:...
TRANSCRIPT
Genome-scaleEs-ma-onoftheTreeofLife
TandyWarnowTheUniversityofIllinois
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Phylogeny(evolu9onarytree)
Phylogenomics = Species trees from whole genomes
“Nothinginbiologymakessenseexceptinthelightofevolu9on”-Dobhzansky
Scien9ficchallenges:• Ultra-largemul9ple-sequencealignment• Alignment-freephylogenyes9ma9on• Supertreees9ma9on• Es9ma9ngspeciestreesfrommanygenetrees• Genomerearrangementphylogeny• Re9culateevolu9on• Visualiza9onoflargetreesandalignments• Dataminingtechniquestoexploremul9pleop9ma• Theore9calguaranteesunderMarkovmodelsofevolu9on
Applica9ons:• metagenomics• proteinstructureandfunc9onpredic9on• traitevolu9on• detec9onofco-evolu9on• systemsbiology
TheTreeofLife:Mul$pleChallenges
Techniques:• Graphtheory(especiallychordalgraphs)• Probabilitytheoryandsta9s9cs• HiddenMarkovmodels• Combinatorialop9miza9on• Heuris9cs• Supercompu9ng
phylogenomics
2
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
“gene” here refers to a portion of the genome (not a functional gene)
Orangutan
Gorilla
Chimpanzee
Human
I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome
Gene tree discordance
3
Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
IncompleteLineageSor9ng(ILS)isadominantcauseofgenetreeheterogeneity
Genetreesinsidethespeciestree(CoalescentProcess)
Present
Past
CourtesyJamesDegnan
GorillaandOrangutanarenotsiblingsinthespeciestree,buttheyareinthegenetree.
IncompleteLineageSor9ng(ILS)
• Confoundsphylogene9canalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.
• Thereissubstan9aldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundsta9s9calconsistencyguarantees(theory)andperformanceondata.
AvianPhylogenomicsProjectEJarvis,HHMI
GZhang,BGI
• Approx.50species,wholegenomes,14,000loci• Jarvis,Mirarab,etal.,Science2014
MTPGilbert,Copenhagen
S.MirarabMd.S.Bayzid,UT-Aus9nUT-Aus9n
T.WarnowUT-Aus9n
Plusmanymanyotherpeople…
Majorchallenges:• Concatena9onanalysistook>250CPUyears,andsuggestedarapidradia9on• Massivegenetreeheterogeneityconsistentwithincompletelineagesor9ng• Standardcoalescent-basedspeciestreees9ma9onmethodscontradicted
concatena9onanalysisandpriorstudies
1KP:ThousandTranscriptomeProject
l 103planttranscriptomes,400-800singlecopy“genes”l Nextphasewillbemuchbiggerl Wickeh,Mirarabetal.,PNAS2014
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin
Challenges:• MassivegenetreeheterogeneityconsistentwithILS• Couldnotuseexis9ngcoalescentmethodsduetomissingdata(manygenetreescouldnotberooted)andlargenumberofspecies
Thistalk• Genetreeheterogeneityduetoincompletelineagesor9ng,
modelledbythemul9-speciescoalescent(MSC)• Sta9s9callyconsistentes9ma9onofspeciestreesunder
theMSC,andtheimpactofgenetreees9ma9onerror• Newmethodsinphylogenomics:
• Sta9s9calbinning(Science2014)andWeightedSta9s9calBinning(PLOSOne2015):improvinggenetrees
• ASTRAL(Bioinforma9cs2014,2015):quartet-basedes9ma9on
• Openques9ons
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Samplingmul9plegenesfrommul9plespecies
Aspeciestreedefinesaprobabilitydistribu9onongenetreesundertheMul9-SpeciesCoalescent(MSC)Model
Present
Past
CourtesyJamesDegnan
GorillaandOrangutanarenotsiblingsinthespeciestree,buttheyareinthegenetree.
Sta9s9calConsistency
error
Data
. . .
Analyzeseparately
Summary Method
Maincompe9ngapproaches gene 1 gene 2 . . . gene k
. . . Concatenation
Species
Sta9s9callyconsistentunderMSC?• CA-ML(Concatena9onusingunpar99onedmaximumlikelihood)-NO
• Mostfrequentgenetree–NO
• MinimizeDeepCoalescences(MDC)–NO
• GreedyConsensus(GC)–NO
• MatrixRepresenta9onwithParsimony(MRP,supertreemethod)–NO
Hence,noneofthesestandardapproachesareproventoconvergetothetruespeciestreeasthenumberoflociincreases.
Manyofthemareposi9velymisleading(willconvergetothewrongtree)!
Anomalyzone• Ananomalousgenetree(AGT)isonethatismore probablethanthetruespeciestreeunderthemul9- speciescoalescentmodel.
• Theorem(Degnan2013,Rosenberg2013):Forn>3, therearemodelspeciestreeswithrootedAGTs,andforn>4therearemodelspeciestreeswithunrootedAGTs.
Anomalyzone• Ananomalousgenetree(AGT)isonethatismore probablethanthetruespeciestreeunderthemul9- speciescoalescentmodel.
• Theorem(Hudson1983):Therearenorooted3-leaf AGTs.
• Theorem(Allmanetal.2011,Degnan2013):Therearenounrooted4-leafAGTs.
. . .
SummaryMethods
. . .
SummaryMethods
Compu9ngrootedspeciestreefromrootedgenetrees:• Foreverythreespecies{a,b,c},
• recordmostfrequentrootedgenetreeon{a,b,c}• Combinerootedthree-leafgenetreesintorootedtreeifthey
arecompa9ble
Theorem:Thisalgorithmissta9s9callyconsistentundertheMSCandrunsinpolynomial9me.
. . .
SummaryMethods
Compu9ngunrootedspeciestreefromunrootedgenetrees:• Foreveryfourspecies{a,b,c,d},
• recordmostfrequentunrootedgenetreeon{a,b,c,d}• Combineunrootedfour-leafgenetreesintounrootedtreeif
theyarecompa9ble(recursivealgorithmbasedonfindingsiblingpairsandremovingonesibling)
Theorem:Thisalgorithmissta9s9callyconsistentundertheMSCandrunsinpolynomial9me.
Sta9s9callyconsistentunderILS?• Coalescent-basedsummarymethods:
– MP-EST(Liuetal.2010):maximumpseudo-likelihoodes9ma9onofrootedspeciestreebasedonrootedtriplettreedistribu9on–YES
– BUCKy-pop(AnéandLarget2010):quartet-basedBayesianspeciestreees9ma9on–YES
– Andmanyothers(ASTRAL,ASTRID,NJst,GLASS,etc.)-YES
• Co-es-ma-onmethods:*BEAST(HeledandDrummond2009):Bayesianco-es9ma9onofgenetreesandspeciestrees–YES
Co-es9ma9onmethodsaretooslowtouseonmostdatasets…hencethedebateislargelybetweenconcatena9on(tradi9onalapproach)andsummarymethods.
• Single-sitemethods(SMRT,SVDquartets,METAL,SNAPP,andothers)-YES
• CA-ML(Concatena9onusingunpar99onedmaximumlikelihood)-NO
• MDC–NO
• GC(GreedyConsensus)–NO
• MRP(supertreemethod)–NO
Sta9s9callyconsistentunderILS?• Coalescent-basedsummarymethods:
– MP-EST(Liuetal.2010):maximumpseudo-likelihoodes9ma9onofrootedspeciestreebasedonrootedtriplettreedistribu9on–YES
– BUCKy-pop(AnéandLarget2010):quartet-basedBayesianspeciestreees9ma9on–YES
– Andmanyothers(ASTRAL,ASTRID,NJst,GLASS,etc.)-YES
• Co-es-ma-onmethods:*BEAST(HeledandDrummond2009):Bayesianco-es9ma9onofgenetreesandspeciestrees–YES
Co-es9ma9onmethodsaretooslowtouseonmostdatasets…hencethedebateislargelybetweenconcatena9on(tradi9onalapproach)andsummarymethods.
• Single-sitemethods(SMRT,SVDquartets,METAL,SNAPP,andothers)-YES
• CA-ML(Concatena9onusingunpar99onedmaximumlikelihood)-NO
• MDC–NO
• GC(GreedyConsensus)–NO
• MRP(supertreemethod)–NO
Sta9s9callyconsistentunderILS?• Coalescent-basedsummarymethods:
– MP-EST(Liuetal.2010):maximumpseudo-likelihoodes9ma9onofrootedspeciestreebasedonrootedtriplettreedistribu9on–YES
– BUCKy-pop(AnéandLarget2010):quartet-basedBayesianspeciestreees9ma9on–YES
– Andmanyothers(ASTRAL,ASTRID,NJst,GLASS,etc.)-YES
• Co-es-ma-onmethods:*BEAST(HeledandDrummond2009):Bayesianco-es9ma9onofgenetreesandspeciestrees–YES
Co-es9ma9onmethodsaretooslowtouseonmostdatasets…hencethedebateislargelybetweenconcatena9on(tradi9onalapproach)andsummarymethods.
• Single-sitemethods(SMRT,SVDquartets,METAL,SNAPP,andothers)-YES
• CA-ML(Concatena9onusingunpar99onedmaximumlikelihood)-NO
• MDC–NO
• GC(GreedyConsensus)–NO
• MRP(supertreemethod)–NO
Resultson11-taxondatasetswithweakILS
*BEASTmoreaccuratethansummarymethods(MP-EST,BUCKy,etc)CA-ML(concatenatedanalysis)mostaccurate
DatasetsfromChungandAné,2011 Bayzid&Warnow,Bioinforma9cs2013
0
0.05
0.1
0.15
0.2
0.25
5−genes 10−genes 25−genes 50−genes
Aver
age
FN
rat
e *BEAST
CA−ML
BUCKy−con
BUCKy−pop
MP−EST
Phylo−exact
MRP
GC
Resultson11-taxondatasetswithweakILS
*BEASTmoreaccuratethansummarymethods(MP-EST,BUCKy,etc)CA-ML(concatenatedanalysis)mostaccurate
DatasetsfromChungandAné,2011 Bayzid&Warnow,Bioinforma9cs2013
0
0.05
0.1
0.15
0.2
0.25
5−genes 10−genes 25−genes 50−genes
Aver
age
FN
rat
e *BEAST
CA−ML
BUCKy−con
BUCKy−pop
MP−EST
Phylo−exact
MRP
GC
*BEASTMOREACCURATEthansummarymethods,because*BEASTgetsmoreaccurategenetrees!
Resultson11-taxondatasetswithweakILS
*BEASTmoreaccuratethansummarymethods(MP-EST,BUCKy,etc)CA-ML(concatenatedanalysis)mostaccurate
DatasetsfromChungandAné,2011 Bayzid&Warnow,Bioinforma9cs2013
0
0.05
0.1
0.15
0.2
0.25
5−genes 10−genes 25−genes 50−genes
Aver
age
FN
rat
e *BEAST
CA−ML
BUCKy−con
BUCKy−pop
MP−EST
Phylo−exact
MRP
GC
Summarymethods(BUCKy-pop,MP-EST)arebothsta9s9callyconsistentundertheMSCbutareimpactedbygenetreees9ma9onerror
Resultson11-taxondatasetswithweakILS
*BEASTmoreaccuratethansummarymethods(MP-EST,BUCKy,etc)CA-ML(concatenatedanalysis)mostaccurate
DatasetsfromChungandAné,2011 Bayzid&Warnow,Bioinforma9cs2013
0
0.05
0.1
0.15
0.2
0.25
5−genes 10−genes 25−genes 50−genes
Aver
age
FN
rat
e *BEAST
CA−ML
BUCKy−con
BUCKy−pop
MP−EST
Phylo−exact
MRP
GC
Concatena9on(RAxML)bestofallmethodsonthesedata!(However,forhighenoughILS,concatena9onisnotasaccurateasthebestsummarymethods.)
ImpactofGeneTreeEs9ma9onErroronMP-EST
MP-ESThasnoerrorontruegenetrees,butMP-ESThas9%errorones-matedgenetrees
Datasets:11-taxonstrongILScondi9onswith50genesSimilarresultsforothersummarymethods(MDC,Greedy,etc.)
0
0.05
0.1
0.15
0.2
0.25
MP−EST
Aver
age
FN
rat
e
trueestimated
• Summarymethodscombinees9matedgenetrees,nottruegenetrees.
• Mul9plestudiesshowthatsummarymethodscanbelessaccuratethanconcatena9oninthepresenceofhighgenetreees9ma9onerror.
• Genome-scaledataincludesarangeofmarkers,notallofwhichhavesubstan9alsignal.Furthermore,removingsitesduetomodelviola9onsreducessignal.
• Someresearchersalsoarguethat“genetrees”shouldbebasedonveryshortalignments,toavoidintra-locusrecombina9on.
TYPICALPHYLOGENOMICSPROBLEM: manypoorgenetrees
• Summarymethodscombinees9matedgenetrees,nottruegenetrees.
• Mul9plestudiesshowthatsummarymethodscanbelessaccuratethanconcatena9oninthepresenceofhighgenetreees9ma9onerror.
• Genome-scaledataincludesarangeofmarkers,notallofwhichhavesubstan9alsignal.Furthermore,removingsitesduetomodelviola9onsreducessignal.
• Someresearchersalsoarguethat“genetrees”shouldbebasedonveryshortalignments,toavoidintra-locusrecombina9on.
Genetreees9ma9onerror:keyissueinthedebate
• Ques9on:Doanysummarymethodsconvergetothespeciestreeasthenumberoflociincrease,butwhereeachlocushasonlyaconstantnumberofsites?
• Answers:Roch&Warnow,SystBiol,March2015:– Strictmolecularclock:Yesforsomenewmethods,evenforasinglesiteperlocus
– Noclock:Unknownforallmethods,including MP-EST,ASTRAL,etc.
S.RochandT.Warnow."Ontherobustnesstogenetreees9ma9onerror(orlackthereof)ofcoalescent-basedspeciestreemethods",Systema9cBiology,64(4):663-676,2015,(PDF)
Whatistheimpactofgenetreees9ma9onerroronspeciestreees9ma9on?
Avian Phylogenomics Project Erich Jarvis, HHMI
Guojie Zhang, BGI
• Approx. 50 species, whole genomes • 14,000 loci • Multi-national team (100+ investigators) • 8 papers published in special issue of Science 2014
Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing “coalescent-based” species tree from 14,000 different gene trees
MTP Gilbert, Copenhagen
Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC
90. J. F. Storz, J. C. Opazo, F. G. Hoffmann, Mol. Phylogenet. Evol.66, 469–478 (2013).
91. F. G. Hoffmann, J. F. Storz, T. A. Gorr, J. C. Opazo, Mol. Biol.Evol. 27, 1126–1138 (2010).
ACKNOWLEDGMENTS
Genome assemblies and annotations of avian genomes in thisstudy are available on the avian phylogenomics website(http://phybirds.genomics.org.cn), GigaDB (http://dx.doi.org/10.5524/101000), National Center for Biotechnology Information(NCBI), and ENSEMBL (NCBI and Ensembl accession numbersare provided in table S2). The majority of this study wassupported by an internal funding from BGI. In addition, G.Z. wassupported by a Marie Curie International Incoming Fellowshipgrant (300837); M.T.P.G. was supported by a Danish NationalResearch Foundation grant (DNRF94) and a Lundbeck Foundationgrant (R52-A5062); C.L. and Q.L. were partially supported by aDanish Council for Independent Research Grant (10-081390);and E.D.J. was supported by the Howard Hughes Medical Instituteand NIH Directors Pioneer Award DP1OD000448.
The Avian Genome ConsortiumChen Ye,1 Shaoguang Liang,1 Zengli Yan,1 M. Lisandra Zepeda,2
Paula F. Campos,2 Amhed Missael Vargas Velazquez,2
José Alfredo Samaniego,2 María Avila-Arcos,2 Michael D. Martin,2
Ross Barnett,2 Angela M. Ribeiro,3 Claudio V. Mello,4 Peter V. Lovell,4
Daniela Almeida,3,5 Emanuel Maldonado,3 Joana Pereira,3
Kartik Sunagar,3,5 Siby Philip,3,5 Maria Gloria Dominguez-Bello,6
Michael Bunce,7 David Lambert,8 Robb T. Brumfield,9
Frederick H. Sheldon,9 Edward C. Holmes,10 Paul P. Gardner,11
Tammy E. Steeves,11 Peter F. Stadler,12 Sarah W. Burge,13
Eric Lyons,14 Jacqueline Smith,15 Fiona McCarthy,16
Frederique Pitel,17 Douglas Rhoads,18 David P. Froman19
1China National GeneBank, BGI-Shenzhen, Shenzhen 518083,China. 2Centre for GeoGenetics, Natural History Museum ofDenmark, University of Copenhagen, Øster Voldgade 5-7, 1350Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar deInvestigação Marinha e Ambiental, Universidade do Porto, Ruados Bragas, 177, 4050-123 Porto, Portugal. 4Department ofBehavioral Neuroscience Oregon Health & Science UniversityPortland, OR 97239, USA. 5Departamento de Biologia, Faculdadede Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal. 6Department of Biology, University of PuertoRico, Av Ponce de Leon, Rio Piedras Campus, JGD 224, San Juan,PR 009431-3360, USA. 7Trace and Environmental DNA laboratory,Department of Environment and Agriculture, Curtin University, Perth,Western Australia 6102, Australia. 8Environmental Futures ResearchInstitute, Griffith University, Nathan, Queensland 4121, Australia.9Museum of Natural Science, Louisiana State University, BatonRouge, LA 70803, USA. 10Marie Bashir Institute for InfectiousDiseases and Biosecurity, Charles Perkins Centre, School ofBiological Sciences and Sydney Medical School, The University ofSydney, Sydney NSW 2006, Australia. 11School of BiologicalSciences, University of Canterbury, Christchurch 8140, New Zealand.12Bioinformatics Group, Department of Computer Science, andInterdisciplinary Center for Bioinformatics, University of Leipzig,Hr̈telstrasse 16-18, D-04107 Leipzig, Germany. 13European MolecularBiology Laboratory, European Bioinformatics Institute, Hinxton,Cambridge CB10 1SD, UK. 14School of Plant Sciences, BIO5 Institute,University of Arizona, Tucson, AZ 85721, USA. 15Division of Geneticsand Genomics, The Roslin Institute and Royal (Dick) School ofVeterinary Studies, The Roslin Institute Building, University ofEdinburgh, Easter Bush Campus, Midlothian EH25 9RG, UK.16Department of Veterinary Science and Microbiology, University ofArizona, 1117 E Lowell Street, Post Office Box 210090-0090, Tucson,AZ 85721, USA. 17Laboratoire de Génétique Cellulaire, INRA Cheminde Borde-Rouge, Auzeville, BP 52627 , 31326 CASTANET-TOLOSANCEDEX, France. 18Department of Biological Sciences, Science andEngineering 601, University of Arkansas, Fayetteville, AR 72701, USA.19Department of Animal Sciences, Oregon State University, Corvallis,OR 97331, USA.
SUPPLEMENTARY MATERIALS
www.sciencemag.org/content/346/6215/1311/suppl/DC1Supplementary TextFigs. S1 to S42Tables S1 to S51References (92–192)
27 January 2014; accepted 6 November 201410.1126/science.1251385
RESEARCH ARTICLE
Whole-genome analyses resolveearly branches in the tree of lifeof modern birdsErich D. Jarvis,1*† Siavash Mirarab,2* Andre J. Aberer,3 Bo Li,4,5,6 Peter Houde,7
Cai Li,4,6 Simon Y. W. Ho,8 Brant C. Faircloth,9,10 Benoit Nabholz,11
Jason T. Howard,1 Alexander Suh,12 Claudia C. Weber,12 Rute R. da Fonseca,6
Jianwen Li,4 Fang Zhang,4 Hui Li,4 Long Zhou,4 Nitish Narula,7,13 Liang Liu,14
Ganesh Ganapathy,1 Bastien Boussau,15 Md. Shamsuzzoha Bayzid,2
Volodymyr Zavidovych,1 Sankar Subramanian,16 Toni Gabaldón,17,18,19
Salvador Capella-Gutiérrez,17,18 Jaime Huerta-Cepas,17,18 Bhanu Rekepalli,20
Kasper Munch,21 Mikkel Schierup,21 Bent Lindow,6 Wesley C. Warren,22
David Ray,23,24,25 Richard E. Green,26 Michael W. Bruford,27 Xiangjiang Zhan,27,28
Andrew Dixon,29 Shengbin Li,30 Ning Li,31 Yinhua Huang,31
Elizabeth P. Derryberry,32,33 Mads Frost Bertelsen,34 Frederick H. Sheldon,33
Robb T. Brumfield,33 Claudio V. Mello,35,36 Peter V. Lovell,35 Morgan Wirthlin,35
Maria Paula Cruz Schneider,36,37 Francisco Prosdocimi,36,38 José Alfredo Samaniego,6
Amhed Missael Vargas Velazquez,6 Alonzo Alfaro-Núñez,6 Paula F. Campos,6
Bent Petersen,39 Thomas Sicheritz-Ponten,39 An Pas,40 Tom Bailey,41 Paul Scofield,42
Michael Bunce,43 David M. Lambert,16 Qi Zhou,44 Polina Perelman,45,46
Amy C. Driskell,47 Beth Shapiro,26 Zijun Xiong,4 Yongli Zeng,4 Shiping Liu,4
Zhenyu Li,4 Binghang Liu,4 Kui Wu,4 Jin Xiao,4 Xiong Yinqi,4 Qiuemei Zheng,4
Yong Zhang,4 Huanming Yang,48 Jian Wang,48 Linnea Smeds,12 Frank E. Rheindt,49
Michael Braun,50 Jon Fjeldsa,51 Ludovic Orlando,6 F. Keith Barker,52
Knud Andreas Jønsson,51,53,54 Warren Johnson,55 Klaus-Peter Koepfli,56
Stephen O’Brien,57,58 David Haussler,59 Oliver A. Ryder,60 Carsten Rahbek,51,54
Eske Willerslev,6 Gary R. Graves,51,61 Travis C. Glenn,62 John McCormack,63
Dave Burt,64 Hans Ellegren,12 Per Alström,65,66 Scott V. Edwards,67
Alexandros Stamatakis,3,68 David P. Mindell,69 Joel Cracraft,70 Edward L. Braun,71
Tandy Warnow,2,72† Wang Jun,48,73,74,75,76† M. Thomas P. Gilbert,6,43† Guojie Zhang4,77†
To better determine the history of modern birds, we performed a genome-scale phylogeneticanalysis of 48 species representing all orders of Neoaves using phylogenomic methodscreated to handle genome-scale data. We recovered a highly resolved tree that confirmspreviously controversial sister or close relationships. We identified the first divergence inNeoaves, two groups we named Passerea and Columbea, representing independent lineagesof diverse and convergently evolved land and water bird species. Among Passerea, we inferthe common ancestor of core landbirds to have been an apex predator and confirm independentgains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging tosister clades. Even with whole genomes, some of the earliest branches in Neoaves provedchallenging to resolve, which was best explained by massive protein-coding sequenceconvergence and high levels of incomplete lineage sorting that occurred during a rapidradiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago.
The diversification of species is not alwaysgradual but can occur in rapid radiations,especially aftermajor environmental changes(1, 2). Paleobiological (3–7) and molecular (8)evidence suggests that such “big bang” radia-
tions occurred for neoavian birds (e.g., songbirds,parrots, pigeons, and others) and placental mam-mals, representing 95% of extant avian and mam-malian species, after the Cretaceous to Paleogene(K-Pg)mass extinction event about 66million yearsago (Ma). However, other nuclear (9–12) and mito-chondrial (13, 14) DNA studies propose an earlier,more gradual diversification, beginning withinthe Cretaceous 80 to 125 Ma. This debate is con-founded by findings that different data sets (15–19)and analytical methods (20, 21) often yield con-
trasting species trees. Resolving such timing andphylogenetic relationships is important for com-parative genomics,which can informabout humantraits and diseases (22).Recent avian studies based on fragments of 5
[~5000 base pairs (bp) (8)] and 19 [31,000 bp (17)]genes recovered some relationships inferred frommorphological data (15, 23) and DNA-DNA hy-bridization (24), postulated new relationships,and contradicted many others. Consistent withmost previous molecular and contemporary mor-phological studies (15), they divided modernbirds (Neornithes) into Palaeognathae (tinamousand flightless ratites), Galloanseres [Galliformes(landfowl) and Anseriformes (waterfowl)], andNeoaves (all other extant birds). Within Neoaves,
1320 12 DECEMBER 2014 • VOL 346 ISSUE 6215 sciencemag.org SCIENCE
A FLOCK OF GENOMES
Jarvis,$Mirarab,$et$al.,$examined$48$
bird$species$using$14,000$loci$from$
whole$genomes.$Two$trees$were$
presented.$
$
1.$A$single$dataset$maximum$
likelihood$concatena,on$analysis$
used$~300$CPU$years$and$1Tb$of$
distributed$memory,$using$TACC$and$
other$supercomputers$around$the$
world.$$
$
2.$However,$every%locus%had%a%different%%tree$–$sugges,ve$of$“incomplete$lineage$sor,ng”$–$and$
the$noisy$genomeHscale$data$required$
the$development$of$a$new$method,$
“sta,s,cal$binning”.$
$
$
$
$
Only48species,butheuris9cMLtook~300CPUyearsonmul9plesupercomputersandused1Tbofmemory
12 DECEMBER 2014 • VOL 346 ISSUE 6215 1337SCIENCE sciencemag.org
INTRODUCTION: Reconstructing species
trees for rapid radiations, as in the early
diversification of birds, is complicated by
biological processes such as incomplete
lineage sorting (ILS)
that can cause differ-
ent parts of the ge-
nome to have different
evolutionary histories.
Statistical methods,
based on the multispe-
cies coalescent model and that combine
gene trees, can be highly accurate even
in the presence of massive ILS; however,
these methods can produce species trees
that are topologically far from the species
tree when estimated gene trees have error.
We have developed a statistical binning
technique to address gene tree estimation
error and have explored its use in genome-
scale species tree estimation with MP-EST,
a popular coalescent-based species tree
estimation method.
Statistical binning enables an
accurate coalescent-based estimation
of the avian tree
AVIAN GENOMICS
Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow*
RESEARCH ARTICLE SUMMARY
The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for
combinabilty, before estimating gene trees.
Statistical binning technique
Statistical binning pipeline
Traditional pipeline (unbinned)
Sequence data
Incompatibility graph
Gene alignments
Binned supergene alignments
Estimated gene trees
Supergene trees
Species tree
Species tree
RATIONALE: In statistical binning, phy-
logenetic trees on different genes are es-
timated and then placed into bins, so that
the differences between trees in the same
bin can be explained by estimation error
(see the figure). A new tree is then esti-
mated for each bin by applying maximum
likelihood to a concatenated alignment of
the multiple sequence alignments of its
genes, and a species tree is estimated us-
ing a coalescent-based species tree method
from these supergene trees.
RESULTS: Under realistic conditions in
our simulation study, statistical binning
reduced the topological error of species
trees estimated using MP-EST and enabled
a coalescent-based analysis that was more
accurate than concatenation even when
gene tree estimation error was relatively
high. Statistical binning also reduced the
error in gene tree topology and species
tree branch length estimation, especially
when the phylogenetic signal in gene se-
quence alignments was low. Species trees
estimated using MP-EST with statisti-
cal binning on four biological data sets
showed increased concordance with the
biological literature. When MP-EST was
used to analyze 14,446 gene trees in the
avian phylogenomics project, it produced
a species tree that was discordant with the
concatenation analysis and conflicted with
prior literature. However, the statistical
binning analysis produced a tree that was
highly congruent with the concatenation
analysis and was consistent with the prior
scientific literature.
CONCLUSIONS: Statistical binning re-
duces the error in species tree topology
and branch length estimation because
it reduces gene tree estimation error.
These improvements are greatest when
gene trees have reduced bootstrap sup-
port, which was the case for the avian
phylogenomics project. Because using
unbinned gene trees can result in over-
estimation of ILS, statistical binning may
be helpful in providing more accurate
estimations of ILS levels in biological
data sets. Thus, statistical binning enables
highly accurate species tree estimations,
even on genome-scale data sets. �
The list of author affiliations is available in the full article online.
*Corresponding author. E-mail: [email protected] this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463
Read the full article
at http://dx.doi
.org/10.1126/
science.1250463
ON OUR WEB SITE
Published by AAAS
on
Janu
ary
7, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Janu
ary
7, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Janu
ary
7, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Janu
ary
7, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Janu
ary
7, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Janu
ary
7, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Janu
ary
7, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Janu
ary
7, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Janu
ary
7, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Janu
ary
7, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
We$used$100$CPU$$
years$(mostly$on$$
TACC)$to$develop$$
and$test$this$$
method.$
Ideasbehindsta9s9calbinning
Numberofsitesinanalignment
• “Genetree”errortendstodecreasewiththenumberofsitesinthealignment
• Concatena9on(evenifnotsta9s9callyconsistent)tendstobereasonablyaccuratewhenthereisnottoomuchgenetreeheterogeneity
12 DECEMBER 2014 • VOL 346 ISSUE 6215 1337SCIENCE sciencemag.org
INTRODUCTION: Reconstructing species
trees for rapid radiations, as in the early
diversification of birds, is complicated by
biological processes such as incomplete
lineage sorting (ILS)
that can cause differ-
ent parts of the ge-
nome to have different
evolutionary histories.
Statistical methods,
based on the multispe-
cies coalescent model and that combine
gene trees, can be highly accurate even
in the presence of massive ILS; however,
these methods can produce species trees
that are topologically far from the species
tree when estimated gene trees have error.
We have developed a statistical binning
technique to address gene tree estimation
error and have explored its use in genome-
scale species tree estimation with MP-EST,
a popular coalescent-based species tree
estimation method.
Statistical binning enables an
accurate coalescent-based estimation
of the avian tree
AVIAN GENOMICS
Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow*
RESEARCH ARTICLE SUMMARY
The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for
combinabilty, before estimating gene trees.
Statistical binning technique
Statistical binning pipeline
Traditional pipeline (unbinned)
Sequence data
Incompatibility graph
Gene alignments
Binned supergene alignments
Estimated gene trees
Supergene trees
Species tree
Species tree
RATIONALE: In statistical binning, phy-
logenetic trees on different genes are es-
timated and then placed into bins, so that
the differences between trees in the same
bin can be explained by estimation error
(see the figure). A new tree is then esti-
mated for each bin by applying maximum
likelihood to a concatenated alignment of
the multiple sequence alignments of its
genes, and a species tree is estimated us-
ing a coalescent-based species tree method
from these supergene trees.
RESULTS: Under realistic conditions in
our simulation study, statistical binning
reduced the topological error of species
trees estimated using MP-EST and enabled
a coalescent-based analysis that was more
accurate than concatenation even when
gene tree estimation error was relatively
high. Statistical binning also reduced the
error in gene tree topology and species
tree branch length estimation, especially
when the phylogenetic signal in gene se-
quence alignments was low. Species trees
estimated using MP-EST with statisti-
cal binning on four biological data sets
showed increased concordance with the
biological literature. When MP-EST was
used to analyze 14,446 gene trees in the
avian phylogenomics project, it produced
a species tree that was discordant with the
concatenation analysis and conflicted with
prior literature. However, the statistical
binning analysis produced a tree that was
highly congruent with the concatenation
analysis and was consistent with the prior
scientific literature.
CONCLUSIONS: Statistical binning re-
duces the error in species tree topology
and branch length estimation because
it reduces gene tree estimation error.
These improvements are greatest when
gene trees have reduced bootstrap sup-
port, which was the case for the avian
phylogenomics project. Because using
unbinned gene trees can result in over-
estimation of ILS, statistical binning may
be helpful in providing more accurate
estimations of ILS levels in biological
data sets. Thus, statistical binning enables
highly accurate species tree estimations,
even on genome-scale data sets. �
The list of author affiliations is available in the full article online.
*Corresponding author. E-mail: [email protected] this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463
Read the full article
at http://dx.doi
.org/10.1126/
science.1250463
ON OUR WEB SITE
Published by AAAS
on
Oct
ober
14,
201
5w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n O
ctob
er 1
4, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Oct
ober
14,
201
5w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n O
ctob
er 1
4, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Oct
ober
14,
201
5w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n O
ctob
er 1
4, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Oct
ober
14,
201
5w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n O
ctob
er 1
4, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Oct
ober
14,
201
5w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n O
ctob
er 1
4, 2
015
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
Note:Supergenetreescomputedusingfullypar99onedmaximumlikelihoodVertex-coloringgraphwithbalancedcolorclassesisNP-hard;weusedheuris9c.
Sta9s9calbinningvs.unbinned
Datasets:11-taxonstrongILSdatasetswith50genesfromChungandAné,Systema9cBiology
Binningproducesbinswithapproximate5to7geneseach
0
0.05
0.1
0.15
0.2
0.25
MP−EST MDC*(75) MRP MRL GC
Av
erag
e F
N r
ate
UnbinnedStatistical−75
Theorem3(PLOSOne,Bayzidetal.2015):Unweightedsta9s9calbinningpipelinesarenotsta9s9cally
consistentunderGTR+MSC
Asthenumberofsitesperlocusincrease:• Alles9matedgenetreesconvergetothetruegenetreeandhavebootstrap
supportthatconvergesto1(Steel2014)• Foreachbin,withprobabilityconvergingto1,thegenesinthebinhavethe
sametreetopology(butcanhavedifferentnumericparameters),andthereisonlyonebinforanygiventreetopology
• Foreachbin,afullypar99onedmaximumlikelihood(ML)analysisofitssupergenealignmentconvergestoatreewiththecommongenetreetopology.
Asthenumberoflociincrease:• everygenetreetopologyappearswithprobabilityconvergingto1.Henceasboththenumberoflociandnumberofsitesperlocusincrease,withprobabilityconvergingto1,everygenetreetopologyappearsexactlyonceinthesetofsupergenetrees.Itisimpossibletoinferthespeciestreefromtheflatdistribu9onofgenetrees!
Theorem2(PLOSOne,Bayzidetal.2015):WSBpipelinesaresta9s9callyconsistent
underGTR+MSC
Easyproof:Asthenumberofsitesperlocusincrease• Alles9matedgenetreesconvergetothetruegenetreeandhave
bootstrapsupportthatconvergesto1(Steel2014)• Foreverybin,withprobabilityconvergingto1,thegenesinthebinhave
thesametreetopology• Fullypar99onedGTRMLanalysisofeachbinconvergestoatreewiththe
commontopologyofthegenesinthebin
Henceasthenumberofsitesperlocusandnumberoflocibothincrease,WSBfollowedbyasta9s9callyconsistentsummarymethodwillconvergeinprobabilitytothetruespeciestree.Q.E.D.
WeightedSta9s9calBinning:empirical
WSBgenerallybenigntohighlybeneficial:
• Improvesaccuracyofgenetreetopology
• Improvesaccuracyofspeciestreetopology
• Improvesaccuracyofspeciestreebranchlength
• Reducesincidenceofhighlysupportedfalseposi9vebranches
(a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length
(d) MP-EST on varying levels of ILS(c) MP-EST on varying numbers of genes(a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length
(d) MP-EST on varying levels of ILS(c) MP-EST on varying numbers of genes
Speciestreees9ma9onerrorforMP-ESTandASTRAL,andalsoconcatena9onusingML,onaviansimulateddatasets:48taxa,moderatelyhighILS(AD=47%),1000genes,andvaryinggenesequencelength.
Sta-s-calbinningvs.UnbinnedandConcatena-on
Bayzidetal.,(2015).PLoSONE10(6):e0129183
97/97
Cursores
Columbea
Otidimorphae
Australaves
80/79
73
67
92
79
94
99
68
88
87
9888
50/48 68
86
95
Binned MP-EST (unweighted/weighted) Unbinned MP-EST
Conflict with other lines of strong evidence
Podiceps cristatus9 7/94
PasseriformesPsittaciformesFalco peregrinusCariama cristataCoraciimorphaeAccipitriformesTyto alba
Cariama cristataCoraciimorphae
Pelecanus crispusEgrett agarzettaNipponia nipponPhalacrocorax carboProcellariimorphaeGavia stellataPhaethon lepturusEurypyga heliasBalearica regulorumCharadrius vociferusOpisthocomus hoazin
Calypte annaChaetura pelagicaAntrostomus carolinensis
Tauraco erythrolophusChlamydotis macqueeniiCuculus canorus
Columbal iviaPterocles gutturalisMesitornis unicolor
Phoenicopterus ruber
Meleagris gallopavoGallus gallusAnas platyrhynchos
Struthio camelusTinamus guttatus
91/87
58/56
59/57
99/99
Podiceps cristatusPhoenicopterus ruber
Cuculus canorus
PasseriformesPsittaciformes
Falco peregrinus
AccipitriformesTyto alba
Pelecanus crispusEgrett agarzettaNipponia nippon
Phalacrocorax carboProcellariimorphae
Gavia stellataPhaethon lepturus
Eurypyga heliasBalearica regulorumCharadrius vociferus
Opisthocomus hoazin
Calypte annaChaetura pelagica
Antrostomus carolinensis
Columbal iviaPterocles gutturalisMesitornis unicolor
Meleagris gallopavoGallus gallus
Anas platyrhynchos
Struthio camelusTinamus guttatus
Tauraco erythrolophusChlamydotis macqueenii
88/90100/99
100/99
100/99
ComparingBinnedandUn-binnedMP-ESTontheAvianDataset
UnbinnedMP-ESTstronglyrejectsColumbea,amajorfindingbyJarvis,Mirarab,etal.BinnedMP-ESTislargelyconsistentwiththeMLconcatena9onanalysis.ThetreespresentedinScience2014weretheMLconcatena9onandBinnedMP-EST
RunningTimeComparison• Concatena9onanalysisoftheAviandataset:
– ~250CPUyearsand1Tbmemory• Sta9s9calbinninganalysis:
– ~5CPUyears,almostallofwhichwascompu9ngmaximumlikelihoodgenetrees,muchlessmemoryusage
Speciestreees9ma9onusingtradi9onalapproachesismorecomputa9onallyexpensive,andnotasaccurateascoalescent-basedmethods!
Summary(sofar)• Sta9s9calbinning(weightedorunweighted):improvesgene
trees,andleadstoimprovedspeciestreesinthepresenceofILScomparedtounbinnedanalyses.
• Sta9s9calbinningpipelinesarealsomoreaccuratethanconcatena9onunderhighILS.
• Pipelinesusingweightedversionaresta9s9callyconsistentunderthemul9-speciescoalescentmodel.
• Sta9s9calbinningpipelinesaremuchfasterthanconcatena9onanalyses(e.g.5yearsvs.250yearsforaviandataset).
1KP:ThousandTranscriptomeProject
l 103planttranscriptomes,400-800singlecopy“genes”l Wickeh,Mirarabetal.,PNAS2014l Nextphasewillbemuchbigger(~1000speciesand~1000genes)
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin
Challenges:• MassivegenetreeheterogeneityconsistentwithILS• Couldnotuseexis9ngcoalescentmethodsduetomissingdata(manygenetreescouldnotberooted)andlargenumberofspecies
1KP:ThousandTranscriptomeProject
l 103planttranscriptomes,400-800singlecopy“genes”l Wickeh,Mirarabetal.,PNAS2014l Nextphasewillbemuchbigger(~1000speciesand~1000genes)
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin
Solu9on:• Newcoalescent-basedmethodASTRAL(Mirarabetal.,ECCB/Bioinforma-cs2014,Mirarabetal.,ISMB/Bioinforma-cs2015)
• ASTRALissta9s9callyconsistent,polynomial9me,andusesunrootedgenetrees.
ASTRAL [Mirarab, et al., ECCB/Bioinformatics, 2014]
• Optimization Problem (NP-Hard):
• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly
15
Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees
Set of quartet trees induced by T
a gene tree
Score(T ) =X
t2TQ(T ) \Q(t)
all input gene trees
ConstrainedMaximumQuartetSupportTree
• Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbipar99ons
• Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT, subjecttoTdrawingitsbipar99onsfromX.
Theorems(Mirarabetal.,2014):• IfXcontainsthebipar99onsfromtheinputgenetrees(and
perhapsothers),thenanexactsolu9ontothisproblemissta9s9callyconsistentundertheMSC.
• TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)9me.(Weusedynamicprogramming,andbuildtheunrootedtreefromthebohom-up,basedon“allowedclades”–halvesoftheallowedbipar99ons.)
ConstrainedMaximumQuartetSupportTree
• Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbipar99ons
• Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT, subjecttoTdrawingitsbipar99onsfromX.
Theorems(Mirarabetal.,2014):• IfXcontainsthebipar99onsfromtheinputgenetrees(and
perhapsothers),thenanexactsolu9ontothisproblemissta9s9callyconsistentundertheMSC.
• TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)9me.(Weusedynamicprogramming,andbuildtheunrootedtreefromthebohom-up,basedon“allowedclades”–halvesoftheallowedbipar99ons.)
ConstrainedMaximumQuartetSupportTree
• Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbipar99ons
• Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT, subjecttoTdrawingitsbipar99onsfromX.
Theorems(Mirarabetal.,2014):• IfXcontainsthebipar99onsfromtheinputgenetrees(and
perhapsothers),thenanexactsolu9ontothisproblemissta9s9callyconsistentundertheMSC.
• TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)9me.(Weusedynamicprogramming,andbuildtheunrootedtreefromthebohom-up,basedon“allowedclades”–halvesoftheallowedbipar99ons.)
Simulation study• Variable parameters:
• Number of species: 10 – 1000
• Number of genes: 50 – 1000
• Amount of ILS: low, medium, high
• Deep versus recent speciation
• 11 model conditions (50 replicas each) with heterogenous gene tree error
• Compare to NJst, MP-EST, concatenation (CA-ML)
• Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree
14
Truegenetrees Sequencedata
Es�matedspeciestree
Finch Falcon Owl Eagle Pigeon
Es�matedgenetreesFinch Owl Falcon Eagle Pigeon
True(model)speciestree
ASTRAL-II
look at all pairs of leaves chosen each from one of the children ofu. For each such pair of leaves, there are
�u0
2
�quartet trees that put
that pair together, where u0 is the number of leaves outside the nodeu. This will examine each pair of nodes in each of the input k nodesexactly once and would therefore require O(n2k) computations.The final score can be normalized by the maximum number of inputquartet trees that include a pair of taxa.
Given the similarity matrix, we calculate an UPGMA tree andadd all its bipartitions to the set X. This heuristic adds relatively fewbipartitions, but the matrix is used in the next heuristic, which is ourmain addition mechanism.
Greedy: We estimate the greedy consensus of the gene trees atdifferent threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3).For each polytomy in each greedy consensus tree, we resolve thepolytomy in multiple ways and add bipartitions implied by thoseresolutions to the set X. First, we resolve the polytomy by applyingthe UPGMA algorithm to the similarity matrix, starting from theclades given by the polytomy. Then, we sample one taxon fromeach side of the ploytomy randomly, and use the greedy consensusof the gene trees restricted to this subsample to find a resolutionof the polytomy (we randomly resolve any multifunctions in thisgreedy consensus on indued subsample). We repeat this process atleast 10 times, but if the subsampled greedy consensus trees includesufficiently frequent bipartitions (defined as > 1%), we do morerounds of random sampling (we increase the number of iterationsby two every time this happens). For each random subsamplearound a polytomy, we also resolve it by calculating an UPGMAtree on the subsampled similarity matrix. Finally, for the two firstgreedy threshold values and the first 10 random subsamples, wealso use a third strategy that can potentially add a larger number ofbipartitions. For each subsampled taxon x, we resolve the polytomyas a caterpillar tree by sorting the remaining taxa according to theirsimilarity with x.
Gene tree polytomies: When gene trees include polytomies, wealso add new bipartitions to set X. We first compute the greedyconsensus of the input gene trees with threshold 0 and if thegreedy consensus has polytomies, we resolve them using UPGMA;we repeat this process twice to account for uncertainty in greedyconsensus estimation. Then, for each gene tree polytomy, we use thetwo resolved consensus trees to infer a resolution of the polytomyand we add the implied resolutions to set X.
3.3 Multi-furcating input gene trees
Extending ASTRAL to inputs that include polytomies requiressolving the weighted quartet tree problem when each node of theinput defines not a tripartition, but a multi-partition of the setof taxa. We start by a basic observation: every resolved quartettree induced by a gene tree maps to two nodes in the gene treeregardless of whether the gene tree is binary or not. In other words,induced quartet trees that map to only one node of the gene tree areunresolved. When maximizing the quartet support, these unresolvedgene tree quartet trees are inconsequential and need to be ignored.Now, consider a polytomy of degree d. There are
�d3
�ways to select
three sides of the polytomy. Each of these ways of selecting threesides defines a tripartition of a subset of taxa. Any selection of twotaxa from one side of this tripartition and one taxon from each of theremaining two sides still defines an induced resolved quartet tree,
0
5
10
15
20
0% 20% 40% 60% 80%RF distance (true species tree vs true gene trees)
dens
ity
rate1e−06 1e−07
tree height10M 2M 500K
(a) True gene tree discordance
0
1
2
3
4
0% 25% 50% 75% 100%RF distance (true vs estimated)
dens
ity
(b) Gene tree estimation error
Fig. 1. Characteristics of the simulation (a) RF distance between the truespecies tree and the true gene trees (50 replicates of 1000 genes) for DatasetI. Tree height directly affects the amount of true discordance; the speciationrate affects true gene tree discordance only with 10M tree length. (b) RFdistance between true gene trees and estimated gene trees for Dataset I. Seealso Figure S1 for inter and intra-replicate gene tree error distributions.
and each induced resolved quartet tree would still map to exactlytwo nodes in our multi-furcating tree. Thus, all the algorithmicassumptions of ASTRAL remain intact, as long as for each multi-furcating node in an input gene tree, we treat it as a collection of
�d3
�
tripartitions. Note that in the presence of polytomies, the runningtime analysis can change because analyzing each multi-furcatingnode requires time cubic in its degree and the degree can increase inprinciple with n. Thus, the running time depends on the patterns ofthe multi-furcations and cannot be studied in a general case.
Statistical Consistency: ASTRAL-I was statistically consistent, andchanges from ASTRAL-I to ASTRAL-II either affect running time,or enlarge the search space, which does not negate consistency.
Theorem 3: ASTRAL-II is statistically consistent for binarycomplete input gene trees.
4 EXPERIMENTAL SETUPSimulation Procedure: We used SimPhy, a tool developed by Malloet al. (2015), to simulate species trees and gene trees (producedin mutation units), and then used Indelible to simulate sequencesdown the gene trees with varying length and model parameters. Weestimated gene trees on these simulated gene alignments, which wethen used in coalescent-based analyses.
We simulated 10 model conditions, which we divide into twodatasets, with one model condition appearing in both datasets. We
3
UsedSimPhy,MalloandPosada,2015
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
200 Estimated Gene Trees
Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees,simulated 1000 bp gene sequences using INDELible 8, 1000 gene trees estimated from GTR simulated sequences using FastTree-27
7Price, Dehal, Arkin 20158Fletcher, Yang 2009
12
AccuracyinthepresenceofHGT+ILS
Davidsonetal.,RECOMB-CG,BMCGenomics2015
Summary• ASTRALisasummarymethodsthatissta9s9callyconsistentin
thepresenceofILS,andthatruninpolynomial9me.ASTRALcananalyzeverylargedatasets(1000speciesand1000genes–ormore)withhighaccuracy.
• Coalescent-basedsummarymethodsaremuchfasterthantradi9onalconcatena9onapproaches,andtheycanprovideimprovedaccuracyinthepresenceofgenetreeheterogeneity.
• Genetreees9ma9onerrorimpactsaccuracyofspeciestrees–butsta9s9calbinningcanreducegenetreees9ma9onerror,andleadtoimprovedspeciestreees9ma9ons(topology,branchlengths,andincidenceoffalseposi9ves).
FutureDirec9ons
• Behercoalescent-basedsummarymethods(thataremorerobusttogenetreees9ma9onerror)
• Behertechniquesfores9ma9nggenetreesgivenmul9-locusdata,orforco-es9ma9nggenetreesandspeciestrees
• Behertheoryaboutrobustnesstogenetreees9ma9onerror(orlackthereof)forcoalescent-basedsummarymethods
• Beher“singlesite”methods(seeSMRT,SVDquartets,METAL,andSNAPP)
Scien9ficchallenges:• Ultra-largemul9ple-sequencealignment• Alignment-freephylogenyes9ma9on• Supertreees9ma9on• Es9ma9ngspeciestreesfrommanygenetrees• Genomerearrangementphylogeny• Re9culateevolu9on• Visualiza9onoflargetreesandalignments• Dataminingtechniquestoexploremul9pleop9ma• Theore9calguaranteesunderMarkovmodelsofevolu9on
Applica9ons:• metagenomics• proteinstructureandfunc9onpredic9on• traitevolu9on• detec9onofco-evolu9on• systemsbiology
TheTreeofLife:Mul$pleChallenges
Techniques:• Graphtheory(especiallychordalgraphs)• Probabilitytheoryandsta9s9cs• HiddenMarkovmodels• Combinatorialop9miza9on• Heuris9cs• Supercompu9ng
BigPhylogenomicData• Thousandstomillionsofsequencesandspecies,millionsof
sitesperspecies• Bigphylogenomicdataarenotthesameastheusualdata• Rela9veperformanceofmethodschangewithdatasetsize
andheterogeneity• Evenmoderate-sizedinputscancreatehugeoutputs
– Weneednewmethods,newop$miza$onproblems,newsta$s$calmodels,newsta$s$caltheory,compressionmethods,visualiza$onmethods,….
Acknowledgments
NSFgrantDBI-1461364(jointwithNoahRosenbergatStanfordandLuayNakhlehatRice):hhp://tandy.cs.illinois.edu/PhylogenomicsProject.html
Papersavailableathhp://tandy.cs.illinois.edu/papers.htmlSoSwareASTRALandsta-s-calbinning:Availableathhps://github.com/smirarabOthersathhp://tandy.cs.illinois.edu/so�ware.htmlOtherFunding:DavidBrutonJr.CentennialProfessorship,TACC(TexasAdvancedCompu9ngCenter),GraingerFounda9on,andHHMI(toSM)