constrained exact op1mizaon in phylogenecs · constrained exact op1mizaon in phylogenecs tandy...

ConstrainedExactOp1miza1oninPhylogene1cs

TandyWarnowTheUniversityofIllinoisatUrbana-Champaign

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Phylogeny(evolu1onarytree)

DNA Sequence Evolution

AAGACTT

TGGACTT AAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT


AAGACTT

TGGACTT AAGGCCT


AAGGCCT TGGACTT

AGCGCTT AGCACAA TAGACTT TAGCCCA AGGGCAT

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTT AGTGCAT

U V W X Y

U

V W

X

Y

Phylogeny+genomics=genome-scalephylogenyes1ma1on.

. . .

Analyzeseparately

Summary Method

Maincompe1ngapproaches gene 1 gene 2 . . . gene k

. . . Concatenation

Species

Phylogenomicpipeline

•  Selecttaxonsetandmarkers

•  Gatherandscreensequencedata,possiblyiden1fyorthologs

•  Computemul1plesequencealignmentand“genetree”foreachlocus

•  Computespeciestreeorphylogene1cnetworkfromthegenetreesoralignments

•  Getsta1s1calsupportoneachbranch(e.g.,bootstrapping)

•  Es1matedatesonthenodesofthephylogeny

•  Usespeciestreewithbranchsupportanddatestounderstandbiology

JustabouteverythingworthdoingisNP-hard

•  Mul1plesequencealignment•  Maximumlikelihoodgenetreees1ma1on•  Speciestreees1ma1onbycombininggenetrees•  Supertreees1ma1on

– andhencedivide-and-conquerstrategiesAndBayesianmethodsareevenmorecomputa1onallyintensive

39 3.8 Branch-swapping methods

(a) Phylogenetic tree (b) Subtree prune. . . (c) and regraft

Figure 3.15 Subtree prune and regraft (SPR). (a) An unrooted phylogenetic tree T . The subtree consistingof three edges and the two leaves labeled a and b is pruned at the location marked “∗” toproduce the two subtrees shown in (b), and then reattached into the edge leading to theleaf labeled f (c).

(a) Tree bisection... (b) reconnection choice... (c) and reconnection

Figure 3.16 Tree bisection and reconnection (TBR). The phylogenetic tree T shown in (a) is bisectedinto two subtrees by deleting the edge marked “∗”. As illustrated in (b), one possible choiceis to reattach the two subtrees by a new edge joining the middles of the two leaf edgesassociated with taxa b and f . The resulting new phylogenetic tree is shown in (c).

The most general branch-swapping method considered here is called tree bisec-tion and reconnection (TBR). To apply this method to a phylogenetic tree, first bisectthe tree into two subtrees by completely removing an edge from the tree. Then, toreattach the tree, join the two subtrees by a new edge that leads from somewherein the one part to somewhere in the other, see Figure 3.16.

For a bifurcating phylogenetic tree T on X , let NNI(T), SPR(T) and TBR(T)denote the set of all phylogenetic trees that can be obtained by applying exactly oneNNI, SPR or TBR operation to T , respectively. It is not too difficult to see that thefollowing inclusions hold:

NNI(T) ⊆ SPR(T) ⊆ TBR(T). (3.9)

Let n denote the number of taxa. It can be shown that the cardinality of NNI(T)and SPR(T) depend only on n. While the size of NNI(T) grows linearly with n,the size of SPR(T) grows quadratically. The size of TBR(T) depends not only onn, but also on the actual topology of the phylogenetic trees considered.

Exercise 3.8.1 (Space of trees is connected) Show that the space of all bifurcatingphylogenetic trees (without edge lengths) on n taxa is connected in the sense that anyone such tree T1 can be transformed into a second one, T2, by a finite series of NNI(and thus also SPR or TBR) operations.

Standardheuris1csearch:TBRmovesthrough“treespace”Randomnesstoexitlocalop1ma

Butthereare(2n-5)!!treesonnleaves

Localsearchheuris1cs

FigurefromHuson2010

Solving maximum likelihood (and other hard optimization problems) is… unlikely

# of Taxa

# of Unrooted Trees

4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Avian Phylogenomics Project Erich Jarvis, HHMI

Guojie Zhang, BGI

•  Approx. 50 species, whole genomes •  14,000 loci •  Multi-national team (100+ investigators) •  8 papers published in special issue of Science 2014

Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing “coalescent-based” species tree from 14,000 different gene trees

MTP Gilbert, Copenhagen

Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC

90. J. F. Storz, J. C. Opazo, F. G. Hoffmann, Mol. Phylogenet. Evol.66, 469–478 (2013).

91. F. G. Hoffmann, J. F. Storz, T. A. Gorr, J. C. Opazo, Mol. Biol.Evol. 27, 1126–1138 (2010).

ACKNOWLEDGMENTS

Genome assemblies and annotations of avian genomes in thisstudy are available on the avian phylogenomics website(http://phybirds.genomics.org.cn), GigaDB (http://dx.doi.org/10.5524/101000), National Center for Biotechnology Information(NCBI), and ENSEMBL (NCBI and Ensembl accession numbersare provided in table S2). The majority of this study wassupported by an internal funding from BGI. In addition, G.Z. wassupported by a Marie Curie International Incoming Fellowshipgrant (300837); M.T.P.G. was supported by a Danish NationalResearch Foundation grant (DNRF94) and a Lundbeck Foundationgrant (R52-A5062); C.L. and Q.L. were partially supported by aDanish Council for Independent Research Grant (10-081390);and E.D.J. was supported by the Howard Hughes Medical Instituteand NIH Directors Pioneer Award DP1OD000448.

The Avian Genome ConsortiumChen Ye,1 Shaoguang Liang,1 Zengli Yan,1 M. Lisandra Zepeda,2

Paula F. Campos,2 Amhed Missael Vargas Velazquez,2

José Alfredo Samaniego,2 María Avila-Arcos,2 Michael D. Martin,2

Ross Barnett,2 Angela M. Ribeiro,3 Claudio V. Mello,4 Peter V. Lovell,4

Daniela Almeida,3,5 Emanuel Maldonado,3 Joana Pereira,3

Kartik Sunagar,3,5 Siby Philip,3,5 Maria Gloria Dominguez-Bello,6

Michael Bunce,7 David Lambert,8 Robb T. Brumfield,9

Frederick H. Sheldon,9 Edward C. Holmes,10 Paul P. Gardner,11

Tammy E. Steeves,11 Peter F. Stadler,12 Sarah W. Burge,13

Eric Lyons,14 Jacqueline Smith,15 Fiona McCarthy,16

Frederique Pitel,17 Douglas Rhoads,18 David P. Froman19

1China National GeneBank, BGI-Shenzhen, Shenzhen 518083,China. 2Centre for GeoGenetics, Natural History Museum ofDenmark, University of Copenhagen, Øster Voldgade 5-7, 1350Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar deInvestigação Marinha e Ambiental, Universidade do Porto, Ruados Bragas, 177, 4050-123 Porto, Portugal. 4Department ofBehavioral Neuroscience Oregon Health & Science UniversityPortland, OR 97239, USA. 5Departamento de Biologia, Faculdadede Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal. 6Department of Biology, University of PuertoRico, Av Ponce de Leon, Rio Piedras Campus, JGD 224, San Juan,PR 009431-3360, USA. 7Trace and Environmental DNA laboratory,Department of Environment and Agriculture, Curtin University, Perth,Western Australia 6102, Australia. 8Environmental Futures ResearchInstitute, Griffith University, Nathan, Queensland 4121, Australia.9Museum of Natural Science, Louisiana State University, BatonRouge, LA 70803, USA. 10Marie Bashir Institute for InfectiousDiseases and Biosecurity, Charles Perkins Centre, School ofBiological Sciences and Sydney Medical School, The University ofSydney, Sydney NSW 2006, Australia. 11School of BiologicalSciences, University of Canterbury, Christchurch 8140, New Zealand.12Bioinformatics Group, Department of Computer Science, andInterdisciplinary Center for Bioinformatics, University of Leipzig,Hr̈telstrasse 16-18, D-04107 Leipzig, Germany. 13European MolecularBiology Laboratory, European Bioinformatics Institute, Hinxton,Cambridge CB10 1SD, UK. 14School of Plant Sciences, BIO5 Institute,University of Arizona, Tucson, AZ 85721, USA. 15Division of Geneticsand Genomics, The Roslin Institute and Royal (Dick) School ofVeterinary Studies, The Roslin Institute Building, University ofEdinburgh, Easter Bush Campus, Midlothian EH25 9RG, UK.16Department of Veterinary Science and Microbiology, University ofArizona, 1117 E Lowell Street, Post Office Box 210090-0090, Tucson,AZ 85721, USA. 17Laboratoire de Génétique Cellulaire, INRA Cheminde Borde-Rouge, Auzeville, BP 52627 , 31326 CASTANET-TOLOSANCEDEX, France. 18Department of Biological Sciences, Science andEngineering 601, University of Arkansas, Fayetteville, AR 72701, USA.19Department of Animal Sciences, Oregon State University, Corvallis,OR 97331, USA.

SUPPLEMENTARY MATERIALS

www.sciencemag.org/content/346/6215/1311/suppl/DC1Supplementary TextFigs. S1 to S42Tables S1 to S51References (92–192)

27 January 2014; accepted 6 November 201410.1126/science.1251385

RESEARCH ARTICLE

Whole-genome analyses resolveearly branches in the tree of lifeof modern birdsErich D. Jarvis,1*† Siavash Mirarab,2* Andre J. Aberer,3 Bo Li,4,5,6 Peter Houde,7

Cai Li,4,6 Simon Y. W. Ho,8 Brant C. Faircloth,9,10 Benoit Nabholz,11

Jason T. Howard,1 Alexander Suh,12 Claudia C. Weber,12 Rute R. da Fonseca,6

Jianwen Li,4 Fang Zhang,4 Hui Li,4 Long Zhou,4 Nitish Narula,7,13 Liang Liu,14

Ganesh Ganapathy,1 Bastien Boussau,15 Md. Shamsuzzoha Bayzid,2

Volodymyr Zavidovych,1 Sankar Subramanian,16 Toni Gabaldón,17,18,19

Salvador Capella-Gutiérrez,17,18 Jaime Huerta-Cepas,17,18 Bhanu Rekepalli,20

Kasper Munch,21 Mikkel Schierup,21 Bent Lindow,6 Wesley C. Warren,22

David Ray,23,24,25 Richard E. Green,26 Michael W. Bruford,27 Xiangjiang Zhan,27,28

Andrew Dixon,29 Shengbin Li,30 Ning Li,31 Yinhua Huang,31

Elizabeth P. Derryberry,32,33 Mads Frost Bertelsen,34 Frederick H. Sheldon,33

Robb T. Brumfield,33 Claudio V. Mello,35,36 Peter V. Lovell,35 Morgan Wirthlin,35

Maria Paula Cruz Schneider,36,37 Francisco Prosdocimi,36,38 José Alfredo Samaniego,6

Amhed Missael Vargas Velazquez,6 Alonzo Alfaro-Núñez,6 Paula F. Campos,6

Bent Petersen,39 Thomas Sicheritz-Ponten,39 An Pas,40 Tom Bailey,41 Paul Scofield,42

Michael Bunce,43 David M. Lambert,16 Qi Zhou,44 Polina Perelman,45,46

Amy C. Driskell,47 Beth Shapiro,26 Zijun Xiong,4 Yongli Zeng,4 Shiping Liu,4

Zhenyu Li,4 Binghang Liu,4 Kui Wu,4 Jin Xiao,4 Xiong Yinqi,4 Qiuemei Zheng,4

Yong Zhang,4 Huanming Yang,48 Jian Wang,48 Linnea Smeds,12 Frank E. Rheindt,49

Michael Braun,50 Jon Fjeldsa,51 Ludovic Orlando,6 F. Keith Barker,52

Knud Andreas Jønsson,51,53,54 Warren Johnson,55 Klaus-Peter Koepfli,56

Stephen O’Brien,57,58 David Haussler,59 Oliver A. Ryder,60 Carsten Rahbek,51,54

Eske Willerslev,6 Gary R. Graves,51,61 Travis C. Glenn,62 John McCormack,63

Dave Burt,64 Hans Ellegren,12 Per Alström,65,66 Scott V. Edwards,67

Alexandros Stamatakis,3,68 David P. Mindell,69 Joel Cracraft,70 Edward L. Braun,71

Tandy Warnow,2,72† Wang Jun,48,73,74,75,76† M. Thomas P. Gilbert,6,43† Guojie Zhang4,77†

To better determine the history of modern birds, we performed a genome-scale phylogeneticanalysis of 48 species representing all orders of Neoaves using phylogenomic methodscreated to handle genome-scale data. We recovered a highly resolved tree that confirmspreviously controversial sister or close relationships. We identified the first divergence inNeoaves, two groups we named Passerea and Columbea, representing independent lineagesof diverse and convergently evolved land and water bird species. Among Passerea, we inferthe common ancestor of core landbirds to have been an apex predator and confirm independentgains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging tosister clades. Even with whole genomes, some of the earliest branches in Neoaves provedchallenging to resolve, which was best explained by massive protein-coding sequenceconvergence and high levels of incomplete lineage sorting that occurred during a rapidradiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago.

The diversification of species is not alwaysgradual but can occur in rapid radiations,especially aftermajor environmental changes(1, 2). Paleobiological (3–7) and molecular (8)evidence suggests that such “big bang” radia-

tions occurred for neoavian birds (e.g., songbirds,parrots, pigeons, and others) and placental mam-mals, representing 95% of extant avian and mam-malian species, after the Cretaceous to Paleogene(K-Pg)mass extinction event about 66million yearsago (Ma). However, other nuclear (9–12) and mito-chondrial (13, 14) DNA studies propose an earlier,more gradual diversification, beginning withinthe Cretaceous 80 to 125 Ma. This debate is con-founded by findings that different data sets (15–19)and analytical methods (20, 21) often yield con-

trasting species trees. Resolving such timing andphylogenetic relationships is important for com-parative genomics,which can informabout humantraits and diseases (22).Recent avian studies based on fragments of 5

[~5000 base pairs (bp) (8)] and 19 [31,000 bp (17)]genes recovered some relationships inferred frommorphological data (15, 23) and DNA-DNA hy-bridization (24), postulated new relationships,and contradicted many others. Consistent withmost previous molecular and contemporary mor-phological studies (15), they divided modernbirds (Neornithes) into Palaeognathae (tinamousand flightless ratites), Galloanseres [Galliformes(landfowl) and Anseriformes (waterfowl)], andNeoaves (all other extant birds). Within Neoaves,

1320 12 DECEMBER 2014 • VOL 346 ISSUE 6215 sciencemag.org SCIENCE

A FLOCK OF GENOMES

Jarvis,$Mirarab,$et$al.,$examined$48$

bird$species$using$14,000$loci$from$

whole$genomes.$Two$trees$were$

presented.$

$

1.$A$single$dataset$maximum$

likelihood$concatena,on$analysis$

used$~300$CPU$years$and$1Tb$of$

distributed$memory,$using$TACC$and$

other$supercomputers$around$the$

world.$$

$

2.$However,$every%locus%had%a%different%%tree$–$sugges,ve$of$“incomplete$lineage$sor,ng”$–$and$

the$noisy$genomeHscale$data$required$

the$development$of$a$new$method,$

“sta,s,cal$binning”.$

$

$

$

$

Only48species,butheuris1cMLtook~300CPUyearsonmul1plesupercomputersandused1Tbofmemory

1KP:ThousandTranscriptomeProject

l  103planttranscriptomes,400-800singlecopy“genes”l  Wicked,Mirarabetal.,PNAS2014l  Nextphasewillbemuchbigger

l  ~1000speciesand~1000genes,l  andwillrequiretheinferenceofmul?plesequencealignmentsand

treesonmorethan100,000sequences

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Andmanyothers

Background Results Conclusions References

Genomic data: challenges and opportunities

Muir et al. [2016]

Muir,2016

ThisTalk

•  Model-basedtreees1ma1onandNP-hardproblems

•  Howdivide-and-conquercanimprovetreees1ma1on

•  Constrainedop1miza1on•  Openproblems

MarkovModelofSiteEvolu1onSimplest(Jukes-Cantor,1969):•  ThemodeltreeTisbinaryandhassubs1tu1onprobabili1esp(e)on

eachedgee.•  Thestateattherootisrandomlydrawnfrom{A,C,T,G}(nucleo1des)•  Ifasite(posi1on)changesonanedge,itchangeswithequalprobability

toeachoftheremainingstates.•  Theevolu1onaryprocessisMarkovian.Thedifferentsitesareassumedtoevolveindependentlyandiden1callydownthetree(withratesthataredrawnfromagammadistribu1on).

Morecomplexmodels(suchastheGeneralMarkovmodel)arealsoconsidered,ojenwithlidlechangetothetheory.

Ques1ons

•  Isthemodeltreeiden1fiable?•  Whiches1ma1onmethodsaresta1s1callyconsistentunderthismodel?

•  Howmuchdatadoesthemethodneedtoes1matethemodeltreecorrectly(withhighprobability),andhowwelldothemethodsperforminprac1ce?

Sta1s1calConsistency

error

Data

Answers?

•  Weknowalotaboutwhichsiteevolu1onmodelsareiden1fiable,andwhichmethodsaresta1s1callyconsistent.

•  Weknowalidlebitaboutthesequencelengthrequirementsforstandardmethods.

•  Extensivestudiesshowthateventhebestmethodsproducegenetreeswithsomeerror.

1  Hill-climbing heuristics for NP-hard optimization criteria (e.g., Maximum Likelihood)

Local optimum

Cost

Global optimum

Phylogenetic trees 2  Polynomial time distance-based methods: Neighbor

Joining, FastME, etc. 3. Bayesian methods

Statistically consistent phylogenetic reconstruction methods

Distance-basedes1ma1on

Quan1fyingError

FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate

FN

FP

NeighborJoining(NJ)onlargetreesSimula1onstudybaseduponfixededgelengths,K2Pmodelofevolu1on,sequencelengthsfixedto1000nucleo1des.

Errorratesreflectpropor1onofincorrectedgesininferredtrees.

[Nakhlehetal.ISMB2001]

NJ

0 400 800 1600 1200 No. Taxa

0

0.2

0.4

0.6

0.8

Erro

r Rat

e

Inotherwords…

error

Data

Sta1s1calconsistencydoesn’tguaranteeaccuracyw.h.p.unlessthesequencesarelongenough.

“Boos1ng”phylogenyreconstruc1onmethods

•  DCMs“boost”theperformanceofphylogenyreconstruc1onmethods.

DCMBasemethodM DCM-M

Divide-and-conquerforphylogenyes1ma1on

MarkovModelofSiteEvolu1onSimplest(Jukes-Cantor,1969):•  ThemodeltreeTisbinaryandhassubs1tu1onprobabili1esp(e)on

eachedgee.•  Thestateattherootisrandomlydrawnfrom{A,C,T,G}(nucleo1des)•  Ifasite(posi1on)changesonanedge,itchangeswithequalprobability

toeachoftheremainingstates.•  Theevolu1onaryprocessisMarkovian.Thedifferentsitesareassumedtoevolveindependentlyandiden1callydownthetree(withratesthataredrawnfromagammadistribu1on).

Morecomplexmodels(suchastheGeneralMarkovmodel)arealsoconsidered,ojenwithlidlechangetothetheory.

Sequencelengthrequirements

Thesequencelength(numberofsites)thataphylogenyreconstruc1onmethodMneedstoreconstructthetruetreewithprobabilityatleast1-εdependson

•  M(themethod)•  ε•  f=minp(e),•  g=maxp(e),and•  n,thenumberofleaves

Wefixeverythingbutn.

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Distance-basedes1ma1on

NeighborJoining’ssequencelengthrequirementisexponen1al!

•  Adeson1999:LetTbeaJukes-Cantormodeltreedefiningaddi1vematrixD.ThenNeighborJoiningwillreconstructthetruetreewithhighprobabilityfromsequencesthatareoflengthO(lgnemaxDij).

•  LaceyandChang2009:Matchinglowerbound


RefinementStep

SupertreeStep

Constructsubsettrees

DCM1Decomposi1ons

DCM1 decomposition : Compute maximal cliques

Input: Set S of sequences, distance matrix d, threshold value

1. Compute threshold graph

}),(:),{(,),,( qjidjiESVEVGq ≤===2. Perform minimum weight triangulation (note: if d is an additive matrix, then the threshold graph is provably triangulated).

}{ ijdq∈

DCM1-boos1ng:Warnow,St.John,andMoret,

SODA2001

•  TheDCM1phaseproducesacollec1onoftrees(oneforeachthreshold),andtheSQSphasepicksthe“best”tree.

•  Foragiventhreshold,thebasemethodisusedtoconstructtreesonsmallsubsets(definedbythethreshold)ofthetaxa.Thesesmalltreesarethencombinedintoatreeonthefullsetoftaxa.

DCM1 SQS

Exponentially converging (base) method

Absolute fast converging (DCM1-boosted) method

NeighborJoiningonlargediametertreesSimula1onstudybaseduponfixededgelengths,K2Pmodelofevolu1on,sequencelengthsfixedto1000nucleo1des.

Errorratesreflectpropor1onofincorrectedgesininferredtrees.

[Nakhlehetal.ISMB2001]

NJ

0 400 800 1600 1200 No. Taxa

0

0.2

0.4

0.6

0.8

Erro

r Rat

e

DCM1-boos1ngdistance-basedmethods[Nakhlehetal.ISMB2001]

• Theorem(Warnowetal.,SODA2001):DCM1-NJconvergestothetruetreefrompolynomiallengthsequences

NJ DCM1-NJ

0 400 800 1600 1200 No. Taxa

0

0.2

0.4

0.6

0.8

Erro

r Rat

e

DCM1-boos1ng

•  DCM1-boos1ng:reducingsequencelengthrequirementsforgenetreeaccuracyfromexponen1altopolynomial

•  Keyalgorithmicingredients:– Constructtriangulatedgraph– Applytreeconstruc1onmethodstosubsets– Combinesubsettreestogetherusingsupertreemethod

– Selectbesttreefromasetoftrees

Otherapplica1onsofdivide-and-conquer

•  DACTAL-boos1ng:– almostalignment-freetreees1ma1on–  Improvingmul1-locusspeciestreees1ma1on

•  DCM2-boos1ng:improvingheuris1csearchesformaximumlikelihood

•  Superfine-boos1ng:improvingsupertreemethods


RefinementStep

SupertreeStep

Constructsubsettrees

SupertreeProblems

•  QuartetMedianSupertree•  Robinson-FouldsSupertree•  MatrixRepresenta1onwithParsimony•  MatrixRepresenta1onwithLikelihood•  Etc.AllareNP-hardbecauseeventes1ngcompa1bilityofunrootedtreesisNP-complete.

phylogenomics

2

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

“gene” here refers to a portion of the genome (not a functional gene)

Orangutan

Gorilla

Chimpanzee

Human

I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome

Gene tree discordance

3

Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1

IncompleteLineageSor1ng(ILS)isadominantcauseofgenetreeheterogeneity

Genetreesinsidethespeciestree(CoalescentProcess)

Present

Past

CourtesyJamesDegnan

GorillaandOrangutanarenotsiblingsinthespeciestree,buttheyareinthegenetree.

IncompleteLineageSor1ng(ILS)

•  Confoundsphylogene1canalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.

•  Thereissubstan1aldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundsta1s1calconsistencyguarantees(theory)andperformanceondata.

Anomalyzone•  Ananomalousgenetree(AGT)isonethatismoreprobablethanthetruespeciestreeunderthemul1-speciescoalescentmodel.

•  Theorem(Hudson1983):Therearenorooted3-leafAGTs.

•  Theorem(Allmanetal.2011,Degnan2013):Therearenounrooted4-leafAGTs.

•  Theorem(Degnan2013,Rosenberg2013):Forn>3,therearemodelspeciestreeswithrootedAGTs,andforn>4therearemodelspeciestreeswithunrootedAGTs.

ASTRAL [Mirarab, et al., ECCB/Bioinformatics, 2014]

• Optimization Problem (NP-Hard):

• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly

15

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

Set of quartet trees induced by T

a gene tree

Score(T ) =X

t2TQ(T ) \Q(t)

all input gene trees

ConstrainedMaximumQuartetSupportTree

•  Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbipar11ons

•  Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT, subjecttoTdrawingitsbipar11onsfromX.

Theorems:•  Mirarabetal.2014:IfXcontainsthebipar11onsfromtheinput

genetrees(andperhapsothers),thenanexactsolu1ontothisproblemissta1s1callyconsistentundertheMSC.

•  MirarabandWarnow2015:TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)1me.(Weusedynamicprogramming,andbuildtheunrootedtreefromthebodom-up,basedon“allowedclades”–halvesoftheallowedbipar11ons.)

Simulation study• Variable parameters:

• Number of species: 10 – 1000

• Number of genes: 50 – 1000

• Amount of ILS: low, medium, high

• Deep versus recent speciation

• 11 model conditions (50 replicas each) with heterogenous gene tree error

• Compare to NJst, MP-EST, concatenation (CA-ML)

• Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree

14

Truegenetrees Sequencedata

Es�matedspeciestree

Finch Falcon Owl Eagle Pigeon

Es�matedgenetreesFinch Owl Falcon Eagle Pigeon

True(model)speciestree

ASTRAL-II

look at all pairs of leaves chosen each from one of the children ofu. For each such pair of leaves, there are

�u0

2

�quartet trees that put

that pair together, where u0 is the number of leaves outside the nodeu. This will examine each pair of nodes in each of the input k nodesexactly once and would therefore require O(n2k) computations.The final score can be normalized by the maximum number of inputquartet trees that include a pair of taxa.

Given the similarity matrix, we calculate an UPGMA tree andadd all its bipartitions to the set X. This heuristic adds relatively fewbipartitions, but the matrix is used in the next heuristic, which is ourmain addition mechanism.

Greedy: We estimate the greedy consensus of the gene trees atdifferent threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3).For each polytomy in each greedy consensus tree, we resolve thepolytomy in multiple ways and add bipartitions implied by thoseresolutions to the set X. First, we resolve the polytomy by applyingthe UPGMA algorithm to the similarity matrix, starting from theclades given by the polytomy. Then, we sample one taxon fromeach side of the ploytomy randomly, and use the greedy consensusof the gene trees restricted to this subsample to find a resolutionof the polytomy (we randomly resolve any multifunctions in thisgreedy consensus on indued subsample). We repeat this process atleast 10 times, but if the subsampled greedy consensus trees includesufficiently frequent bipartitions (defined as > 1%), we do morerounds of random sampling (we increase the number of iterationsby two every time this happens). For each random subsamplearound a polytomy, we also resolve it by calculating an UPGMAtree on the subsampled similarity matrix. Finally, for the two firstgreedy threshold values and the first 10 random subsamples, wealso use a third strategy that can potentially add a larger number ofbipartitions. For each subsampled taxon x, we resolve the polytomyas a caterpillar tree by sorting the remaining taxa according to theirsimilarity with x.

Gene tree polytomies: When gene trees include polytomies, wealso add new bipartitions to set X. We first compute the greedyconsensus of the input gene trees with threshold 0 and if thegreedy consensus has polytomies, we resolve them using UPGMA;we repeat this process twice to account for uncertainty in greedyconsensus estimation. Then, for each gene tree polytomy, we use thetwo resolved consensus trees to infer a resolution of the polytomyand we add the implied resolutions to set X.

3.3 Multi-furcating input gene trees

Extending ASTRAL to inputs that include polytomies requiressolving the weighted quartet tree problem when each node of theinput defines not a tripartition, but a multi-partition of the setof taxa. We start by a basic observation: every resolved quartettree induced by a gene tree maps to two nodes in the gene treeregardless of whether the gene tree is binary or not. In other words,induced quartet trees that map to only one node of the gene tree areunresolved. When maximizing the quartet support, these unresolvedgene tree quartet trees are inconsequential and need to be ignored.Now, consider a polytomy of degree d. There are

�d3

�ways to select

three sides of the polytomy. Each of these ways of selecting threesides defines a tripartition of a subset of taxa. Any selection of twotaxa from one side of this tripartition and one taxon from each of theremaining two sides still defines an induced resolved quartet tree,

0

5

10

15

20

0% 20% 40% 60% 80%RF distance (true species tree vs true gene trees)

dens

ity

rate1e−06 1e−07

tree height10M 2M 500K

(a) True gene tree discordance

0

1

2

3

4

0% 25% 50% 75% 100%RF distance (true vs estimated)

dens

ity

(b) Gene tree estimation error

Fig. 1. Characteristics of the simulation (a) RF distance between the truespecies tree and the true gene trees (50 replicates of 1000 genes) for DatasetI. Tree height directly affects the amount of true discordance; the speciationrate affects true gene tree discordance only with 10M tree length. (b) RFdistance between true gene trees and estimated gene trees for Dataset I. Seealso Figure S1 for inter and intra-replicate gene tree error distributions.

and each induced resolved quartet tree would still map to exactlytwo nodes in our multi-furcating tree. Thus, all the algorithmicassumptions of ASTRAL remain intact, as long as for each multi-furcating node in an input gene tree, we treat it as a collection of

�d3

�

tripartitions. Note that in the presence of polytomies, the runningtime analysis can change because analyzing each multi-furcatingnode requires time cubic in its degree and the degree can increase inprinciple with n. Thus, the running time depends on the patterns ofthe multi-furcations and cannot be studied in a general case.

Statistical Consistency: ASTRAL-I was statistically consistent, andchanges from ASTRAL-I to ASTRAL-II either affect running time,or enlarge the search space, which does not negate consistency.

Theorem 3: ASTRAL-II is statistically consistent for binarycomplete input gene trees.

4 EXPERIMENTAL SETUPSimulation Procedure: We used SimPhy, a tool developed by Malloet al. (2015), to simulate species trees and gene trees (producedin mutation units), and then used Indelible to simulate sequencesdown the gene trees with varying length and model parameters. Weestimated gene trees on these simulated gene alignments, which wethen used in coalescent-based analyses.

We simulated 10 model conditions, which we divide into twodatasets, with one model condition appearing in both datasets. We

3

UsedSimPhy,MalloandPosada,2015

16

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

16

4%

8%

12%

16%


Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%


Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

Otherpolynomial1mealgorithmsforconstrainedop1miza1onproblems

•  MinimizeDuplica1on/LossSupertree(HalledandLagergren2000)

•  QuartetSupport(BryantandSteel2001)•  ConstrainedMinimizeDeepCoalescence(ThanandNakhleh2000,Yu,Warnow,andNakhleh2011).

•  Genetreees1ma1onunderDuplica1onandLoss(Szöllősietal.2013)

•  ConstrainedRobinson-FouldsSupertree(Vachaspa1andWarnow2016)

Summary

•  NP-hardop1miza1onproblemsaboundinphylogenyreconstruc1on,andincomputa1onalbiologyingeneral,andneedveryaccuratesolu1ons.

•  Divide-and-conquertechniquescanprovidegreatlyimprovedaccuracyandscalability,andexcellentsta1s1calperformanceguarantees.Butdivide-and-conquermethodsdependonhavinggoodsupertreemethods.

•  Constrainedexactop1miza1onhasbeensurprisinglybeneficialforphylogenomicanalysis.

OpenProblems•  Whatotherop1miza1onproblemsaretractable,given

constraintsets?•  HowshouldwedefinethesetXofconstraintsfromasetof

sourcetrees,especiallywhenthesourcetreesareincomplete?

•  Arethereotherwaysofconstrainingthesearchspacethatareeffec1veanduseful?

•  Whatareothereffec1vewaysofapproachingtrulylarge-scalephylogenyes1ma1on?

•  Candivide-and-conquerbeemployedwithBayesianmethods?(Or,moregenerally,howcanwemakeBayesianmethodsmorescalable?)

Acknowledgments

NSFgrantDBI-1461364(jointwithNoahRosenbergatStanfordandLuayNakhlehatRice): hdp://tandy.cs.illinois.edu/PhylogenomicsProject.htmlPapersavailableathdp://tandy.cs.illinois.edu/papers.htmlASTRAL:Availableathdps://github.com/smirarabFastRFS:Availableathdps://github.com/pranjalv123/FastRFS

ConstrainedRobinson-FouldsSupertree

•  Input:setT ofsourcetrees,andsetXofallowedbipar11ons

•  Output:treeTthatminimizesthetotalRobinson-Fouldsdistancetotheinputsourcetrees,andthatdrawsitsbipar11onsfromX.

Theorem(Vachaspa1andWarnow2016):TheonstrainedRFSupertreecanbesolvedinO(|X|2nk)1me,wherethereareksourcetreesandnspecies.

Avian Phylogenomics Project Erich Jarvis, HHMI

Guojie Zhang, BGI

•  Approx. 50 species, whole genomes •  14,000 loci •  Multi-national team (100+ investigators) •  8 papers published in special issue of Science 2014

Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing “coalescent-based” species tree from 14,000 different gene trees

MTP Gilbert, Copenhagen

Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC

…ACGGTGCAGTTACC-A…

…AC----CAGTCACCTA…

Thetruemul1plealignment–  Reflects historical substitution, insertion, and deletion

events –  Defined using transitive closure of pairwise alignments

computed on edges of the true tree

…ACGGTGCAGTTACCA…

Substitution Deletion

…ACCAGTCACCTA…

Insertion

1KP:ThousandTranscriptomeProject

l  103planttranscriptomes,400-800singlecopy“genes”l  Wicked,Mirarabetal.,PNAS2014l  Nextphasewillbemuchbigger

l  ~1000speciesand~1000genes,l  andwillrequiretheinferenceofmul?plesequencealignmentsand

treesonmorethan100,000sequences




N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Andmanyothers

Constrainedop1miza1on

•  FirstproposedinHalletandLagergren2000,fortheduplica1on-lossspeciestreeproblem

•  Algorithmsforconstrainedop1miza1onusedynamicprogrammingtofindop1malsolu1ons,construc1ngthebesttreefromthe“bodom-up”

•  ThesetXisusuallydefinedtobethebipar11onsfromthesourcetrees.

DACTAL

New supertree method: SuperFine

Existing Method: RAxML(MAFFT)

pRecDCM3

BLAST-based

Overlapping subsets

A tree for each subset

Unaligned Sequences

A tree for the entire dataset

Resultsonthreebiologicaldatasets–6000to28,000sequences.Weshowresultswith5DACTALiteraKons

DACTAL

Construct supertree

Construct subset trees

Decompose using tree

Overlapping subsets

A tree for each subset

Arbitrary input

A tree for the entire dataset

TriangulatedGraphs

•  Defini1on:Agraphistriangulatedifithasnosimplecyclesofsizefourormore.



Samplingmul1plegenesfrommul1plespecies

Standardheuris1csearch

T

T’

Hill-climbingRandomperturba1on

Problems with current techniques for MP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Shown here is the performance of the TNT software for maximum parsimony on a real dataset of almost 14,000 sequences. The required level of accuracy with respect to MP score is no more than 0.01% error (otherwise high topological error results). (“Optimal” here means best score to date, using any method for any amount of time.)

Performance of TNT with time

SolvingNP-hardproblemsexactlyis…unlikely

•  Numberof(unrooted)binarytreesonnleavesis(2n-5)!!

•  Ifeachtreeon1000taxacouldbeanalyzedin0.001seconds,wewouldfindthebesttreein

2890millennia

#leaves #trees4 35 156 1057 9458 103959 13513510 202702520 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Gene tree discordance

3

Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1

AvianPhylogenomicsProjectEJarvis,HHMI

GZhang,BGI

• Approx.50species,wholegenomes,14,000loci

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus1nUT-Aus1n

T.WarnowUT-Aus1n

Plusmanymanyotherpeople…

Challenges:•  Concatena1onanalysisused>200CPUyears

Butalso•  Massivegenetreeconflictsugges1veofILS,andmostgenetreeshadverylowbootstrapsupport•  Coalescent-basedanalysisusingMP-ESTproducedtreethatconflictedwithconcatena1onanalysis

Bigdatasetsandhardproblems

•  Computa1onallyintensiveproblems:– Mul1plesequencealignment– Maximumlikelihoodgenetreees1ma1on– Speciestreees1ma1onfrommul1plegenetrees

•  Fastmethodsaregenerallynotsufficientlyaccurate

•  Accuratemethodsaregenerallycomputa1onallyintensive

Quantifying Error

FN: false negative (missing edge)

FP: false positive (incorrect edge)

FP

50% error rate

FN

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001]

Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining!

NJ

0 400 800 No. Taxa

1600 1200

0.4 0.2

0

0.6

0.8

Erro

r Rat

e



Species Tree Estimation requires multiple genes!



Speciestreees1ma1on:difficult,evenforsmalldatasets!

Major Challenges: large datasets, fragmentary sequences

•  Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.

•  Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).

•  Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.

•  Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution

Both phylogenetic estimation and multiple sequence alignment are also

impacted by fragmentary data.

Strict Consensus Merger (SCM)a b

c d

e

f g

a b

c d h

i j

e

f g

h i j

a b

c

d

a b

c

d

e

f g

a b

c

d h

i j

Large-scale statistical phylogeny estimation Ultra-large multiple-sequence alignment Estimating species trees from incongruent gene trees Supertree estimation Genome rearrangement phylogeny Reticulate evolution

Visualization of large trees and alignments Data mining techniques to explore multiple optima

The Tree of Life: Multiple Challenges

Largedatasets:100,000+sequences10,000+genes“BigData”complexity

Scien1ficchallenges:•  Ultra-largemul1ple-sequencealignment•  Alignment-freephylogenyes1ma1on•  Supertreees1ma1on•  Es1ma1ngspeciestreesfrommanygenetrees•  Genomerearrangementphylogeny•  Re1culateevolu1on•  Visualiza1onoflargetreesandalignments•  Dataminingtechniquestoexploremul1pleop1ma•  Theore1calguaranteesunderMarkovmodelsofevolu1on

Applica1ons:•  metagenomics•  proteinstructureandfunc1onpredic1on•  traitevolu1on•  detec1onofco-evolu1on•  systemsbiology

TheTreeofLife:Mul?pleChallenges

Techniques:•  Graphtheory(especiallychordalgraphs)•  Probabilitytheoryandsta1s1cs•  HiddenMarkovmodels•  Combinatorialop1miza1on•  Heuris1cs•  Supercompu1ng

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)l  Firstpaper:PNAS2014(~100speciesand~800loci)•  GeneTreeIncongruence




N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin

Plus many many other people…

UpcomingChallenges(~1200species,~400loci):•  Speciestreees1ma1onunderthemul1-speciescoalescentfromhundredsofconflic1nggenetreeson>1000species(wewilluseASTRAL–Mirarabetal.2014,Mirarab&Warnow 2015)•  Mul1plesequencealignmentof>100,000sequences(withlotsoffragments!)–wewilluseUPP(Nguyenetal.,2015)

phylogenomics

2

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

“gene” here refers to a portion of the genome (not a functional gene)

Orangutan

Gorilla

Chimpanzee

Human

I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome

constrained exact op1mizaon in phylogenecs · constrained exact op1mizaon in phylogenecs tandy...

Documents