constrained exact op1mizaon in phylogenecs · constrained exact op1mizaon in phylogenecs tandy...
TRANSCRIPT
ConstrainedExactOp1miza1oninPhylogene1cs
TandyWarnowTheUniversityofIllinoisatUrbana-Champaign
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Phylogeny(evolu1onarytree)
DNA Sequence Evolution
AAGACTT
TGGACTT AAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTT AAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTT AGCACAA TAGACTT TAGCCCA AGGGCAT
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTT AGTGCAT
U V W X Y
U
V W
X
Y
Phylogeny+genomics=genome-scalephylogenyes1ma1on.
. . .
Analyzeseparately
Summary Method
Maincompe1ngapproaches gene 1 gene 2 . . . gene k
. . . Concatenation
Species
Phylogenomicpipeline
• Selecttaxonsetandmarkers
• Gatherandscreensequencedata,possiblyiden1fyorthologs
• Computemul1plesequencealignmentand“genetree”foreachlocus
• Computespeciestreeorphylogene1cnetworkfromthegenetreesoralignments
• Getsta1s1calsupportoneachbranch(e.g.,bootstrapping)
• Es1matedatesonthenodesofthephylogeny
• Usespeciestreewithbranchsupportanddatestounderstandbiology
JustabouteverythingworthdoingisNP-hard
• Mul1plesequencealignment• Maximumlikelihoodgenetreees1ma1on• Speciestreees1ma1onbycombininggenetrees• Supertreees1ma1on
– andhencedivide-and-conquerstrategiesAndBayesianmethodsareevenmorecomputa1onallyintensive
JustabouteverythingworthdoingisNP-hard
• Mul1plesequencealignment• Maximumlikelihoodgenetreees1ma1on• Speciestreees1ma1onbycombininggenetrees• Supertreees1ma1on
– andhencedivide-and-conquerstrategiesAndBayesianmethodsareevenmorecomputa1onallyintensive
39 3.8 Branch-swapping methods
(a) Phylogenetic tree (b) Subtree prune. . . (c) and regraft
Figure 3.15 Subtree prune and regraft (SPR). (a) An unrooted phylogenetic tree T . The subtree consistingof three edges and the two leaves labeled a and b is pruned at the location marked “∗” toproduce the two subtrees shown in (b), and then reattached into the edge leading to theleaf labeled f (c).
(a) Tree bisection... (b) reconnection choice... (c) and reconnection
Figure 3.16 Tree bisection and reconnection (TBR). The phylogenetic tree T shown in (a) is bisectedinto two subtrees by deleting the edge marked “∗”. As illustrated in (b), one possible choiceis to reattach the two subtrees by a new edge joining the middles of the two leaf edgesassociated with taxa b and f . The resulting new phylogenetic tree is shown in (c).
The most general branch-swapping method considered here is called tree bisec-tion and reconnection (TBR). To apply this method to a phylogenetic tree, first bisectthe tree into two subtrees by completely removing an edge from the tree. Then, toreattach the tree, join the two subtrees by a new edge that leads from somewherein the one part to somewhere in the other, see Figure 3.16.
For a bifurcating phylogenetic tree T on X , let NNI(T), SPR(T) and TBR(T)denote the set of all phylogenetic trees that can be obtained by applying exactly oneNNI, SPR or TBR operation to T , respectively. It is not too difficult to see that thefollowing inclusions hold:
NNI(T) ⊆ SPR(T) ⊆ TBR(T). (3.9)
Let n denote the number of taxa. It can be shown that the cardinality of NNI(T)and SPR(T) depend only on n. While the size of NNI(T) grows linearly with n,the size of SPR(T) grows quadratically. The size of TBR(T) depends not only onn, but also on the actual topology of the phylogenetic trees considered.
Exercise 3.8.1 (Space of trees is connected) Show that the space of all bifurcatingphylogenetic trees (without edge lengths) on n taxa is connected in the sense that anyone such tree T1 can be transformed into a second one, T2, by a finite series of NNI(and thus also SPR or TBR) operations.
Standardheuris1csearch:TBRmovesthrough“treespace”Randomnesstoexitlocalop1ma
Butthereare(2n-5)!!treesonnleaves
Localsearchheuris1cs
FigurefromHuson2010
Solving maximum likelihood (and other hard optimization problems) is… unlikely
# of Taxa
# of Unrooted Trees
4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900
Avian Phylogenomics Project Erich Jarvis, HHMI
Guojie Zhang, BGI
• Approx. 50 species, whole genomes • 14,000 loci • Multi-national team (100+ investigators) • 8 papers published in special issue of Science 2014
Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing “coalescent-based” species tree from 14,000 different gene trees
MTP Gilbert, Copenhagen
Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC
90. J. F. Storz, J. C. Opazo, F. G. Hoffmann, Mol. Phylogenet. Evol.66, 469–478 (2013).
91. F. G. Hoffmann, J. F. Storz, T. A. Gorr, J. C. Opazo, Mol. Biol.Evol. 27, 1126–1138 (2010).
ACKNOWLEDGMENTS
Genome assemblies and annotations of avian genomes in thisstudy are available on the avian phylogenomics website(http://phybirds.genomics.org.cn), GigaDB (http://dx.doi.org/10.5524/101000), National Center for Biotechnology Information(NCBI), and ENSEMBL (NCBI and Ensembl accession numbersare provided in table S2). The majority of this study wassupported by an internal funding from BGI. In addition, G.Z. wassupported by a Marie Curie International Incoming Fellowshipgrant (300837); M.T.P.G. was supported by a Danish NationalResearch Foundation grant (DNRF94) and a Lundbeck Foundationgrant (R52-A5062); C.L. and Q.L. were partially supported by aDanish Council for Independent Research Grant (10-081390);and E.D.J. was supported by the Howard Hughes Medical Instituteand NIH Directors Pioneer Award DP1OD000448.
The Avian Genome ConsortiumChen Ye,1 Shaoguang Liang,1 Zengli Yan,1 M. Lisandra Zepeda,2
Paula F. Campos,2 Amhed Missael Vargas Velazquez,2
José Alfredo Samaniego,2 María Avila-Arcos,2 Michael D. Martin,2
Ross Barnett,2 Angela M. Ribeiro,3 Claudio V. Mello,4 Peter V. Lovell,4
Daniela Almeida,3,5 Emanuel Maldonado,3 Joana Pereira,3
Kartik Sunagar,3,5 Siby Philip,3,5 Maria Gloria Dominguez-Bello,6
Michael Bunce,7 David Lambert,8 Robb T. Brumfield,9
Frederick H. Sheldon,9 Edward C. Holmes,10 Paul P. Gardner,11
Tammy E. Steeves,11 Peter F. Stadler,12 Sarah W. Burge,13
Eric Lyons,14 Jacqueline Smith,15 Fiona McCarthy,16
Frederique Pitel,17 Douglas Rhoads,18 David P. Froman19
1China National GeneBank, BGI-Shenzhen, Shenzhen 518083,China. 2Centre for GeoGenetics, Natural History Museum ofDenmark, University of Copenhagen, Øster Voldgade 5-7, 1350Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar deInvestigação Marinha e Ambiental, Universidade do Porto, Ruados Bragas, 177, 4050-123 Porto, Portugal. 4Department ofBehavioral Neuroscience Oregon Health & Science UniversityPortland, OR 97239, USA. 5Departamento de Biologia, Faculdadede Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal. 6Department of Biology, University of PuertoRico, Av Ponce de Leon, Rio Piedras Campus, JGD 224, San Juan,PR 009431-3360, USA. 7Trace and Environmental DNA laboratory,Department of Environment and Agriculture, Curtin University, Perth,Western Australia 6102, Australia. 8Environmental Futures ResearchInstitute, Griffith University, Nathan, Queensland 4121, Australia.9Museum of Natural Science, Louisiana State University, BatonRouge, LA 70803, USA. 10Marie Bashir Institute for InfectiousDiseases and Biosecurity, Charles Perkins Centre, School ofBiological Sciences and Sydney Medical School, The University ofSydney, Sydney NSW 2006, Australia. 11School of BiologicalSciences, University of Canterbury, Christchurch 8140, New Zealand.12Bioinformatics Group, Department of Computer Science, andInterdisciplinary Center for Bioinformatics, University of Leipzig,Hr̈telstrasse 16-18, D-04107 Leipzig, Germany. 13European MolecularBiology Laboratory, European Bioinformatics Institute, Hinxton,Cambridge CB10 1SD, UK. 14School of Plant Sciences, BIO5 Institute,University of Arizona, Tucson, AZ 85721, USA. 15Division of Geneticsand Genomics, The Roslin Institute and Royal (Dick) School ofVeterinary Studies, The Roslin Institute Building, University ofEdinburgh, Easter Bush Campus, Midlothian EH25 9RG, UK.16Department of Veterinary Science and Microbiology, University ofArizona, 1117 E Lowell Street, Post Office Box 210090-0090, Tucson,AZ 85721, USA. 17Laboratoire de Génétique Cellulaire, INRA Cheminde Borde-Rouge, Auzeville, BP 52627 , 31326 CASTANET-TOLOSANCEDEX, France. 18Department of Biological Sciences, Science andEngineering 601, University of Arkansas, Fayetteville, AR 72701, USA.19Department of Animal Sciences, Oregon State University, Corvallis,OR 97331, USA.
SUPPLEMENTARY MATERIALS
www.sciencemag.org/content/346/6215/1311/suppl/DC1Supplementary TextFigs. S1 to S42Tables S1 to S51References (92–192)
27 January 2014; accepted 6 November 201410.1126/science.1251385
RESEARCH ARTICLE
Whole-genome analyses resolveearly branches in the tree of lifeof modern birdsErich D. Jarvis,1*† Siavash Mirarab,2* Andre J. Aberer,3 Bo Li,4,5,6 Peter Houde,7
Cai Li,4,6 Simon Y. W. Ho,8 Brant C. Faircloth,9,10 Benoit Nabholz,11
Jason T. Howard,1 Alexander Suh,12 Claudia C. Weber,12 Rute R. da Fonseca,6
Jianwen Li,4 Fang Zhang,4 Hui Li,4 Long Zhou,4 Nitish Narula,7,13 Liang Liu,14
Ganesh Ganapathy,1 Bastien Boussau,15 Md. Shamsuzzoha Bayzid,2
Volodymyr Zavidovych,1 Sankar Subramanian,16 Toni Gabaldón,17,18,19
Salvador Capella-Gutiérrez,17,18 Jaime Huerta-Cepas,17,18 Bhanu Rekepalli,20
Kasper Munch,21 Mikkel Schierup,21 Bent Lindow,6 Wesley C. Warren,22
David Ray,23,24,25 Richard E. Green,26 Michael W. Bruford,27 Xiangjiang Zhan,27,28
Andrew Dixon,29 Shengbin Li,30 Ning Li,31 Yinhua Huang,31
Elizabeth P. Derryberry,32,33 Mads Frost Bertelsen,34 Frederick H. Sheldon,33
Robb T. Brumfield,33 Claudio V. Mello,35,36 Peter V. Lovell,35 Morgan Wirthlin,35
Maria Paula Cruz Schneider,36,37 Francisco Prosdocimi,36,38 José Alfredo Samaniego,6
Amhed Missael Vargas Velazquez,6 Alonzo Alfaro-Núñez,6 Paula F. Campos,6
Bent Petersen,39 Thomas Sicheritz-Ponten,39 An Pas,40 Tom Bailey,41 Paul Scofield,42
Michael Bunce,43 David M. Lambert,16 Qi Zhou,44 Polina Perelman,45,46
Amy C. Driskell,47 Beth Shapiro,26 Zijun Xiong,4 Yongli Zeng,4 Shiping Liu,4
Zhenyu Li,4 Binghang Liu,4 Kui Wu,4 Jin Xiao,4 Xiong Yinqi,4 Qiuemei Zheng,4
Yong Zhang,4 Huanming Yang,48 Jian Wang,48 Linnea Smeds,12 Frank E. Rheindt,49
Michael Braun,50 Jon Fjeldsa,51 Ludovic Orlando,6 F. Keith Barker,52
Knud Andreas Jønsson,51,53,54 Warren Johnson,55 Klaus-Peter Koepfli,56
Stephen O’Brien,57,58 David Haussler,59 Oliver A. Ryder,60 Carsten Rahbek,51,54
Eske Willerslev,6 Gary R. Graves,51,61 Travis C. Glenn,62 John McCormack,63
Dave Burt,64 Hans Ellegren,12 Per Alström,65,66 Scott V. Edwards,67
Alexandros Stamatakis,3,68 David P. Mindell,69 Joel Cracraft,70 Edward L. Braun,71
Tandy Warnow,2,72† Wang Jun,48,73,74,75,76† M. Thomas P. Gilbert,6,43† Guojie Zhang4,77†
To better determine the history of modern birds, we performed a genome-scale phylogeneticanalysis of 48 species representing all orders of Neoaves using phylogenomic methodscreated to handle genome-scale data. We recovered a highly resolved tree that confirmspreviously controversial sister or close relationships. We identified the first divergence inNeoaves, two groups we named Passerea and Columbea, representing independent lineagesof diverse and convergently evolved land and water bird species. Among Passerea, we inferthe common ancestor of core landbirds to have been an apex predator and confirm independentgains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging tosister clades. Even with whole genomes, some of the earliest branches in Neoaves provedchallenging to resolve, which was best explained by massive protein-coding sequenceconvergence and high levels of incomplete lineage sorting that occurred during a rapidradiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago.
The diversification of species is not alwaysgradual but can occur in rapid radiations,especially aftermajor environmental changes(1, 2). Paleobiological (3–7) and molecular (8)evidence suggests that such “big bang” radia-
tions occurred for neoavian birds (e.g., songbirds,parrots, pigeons, and others) and placental mam-mals, representing 95% of extant avian and mam-malian species, after the Cretaceous to Paleogene(K-Pg)mass extinction event about 66million yearsago (Ma). However, other nuclear (9–12) and mito-chondrial (13, 14) DNA studies propose an earlier,more gradual diversification, beginning withinthe Cretaceous 80 to 125 Ma. This debate is con-founded by findings that different data sets (15–19)and analytical methods (20, 21) often yield con-
trasting species trees. Resolving such timing andphylogenetic relationships is important for com-parative genomics,which can informabout humantraits and diseases (22).Recent avian studies based on fragments of 5
[~5000 base pairs (bp) (8)] and 19 [31,000 bp (17)]genes recovered some relationships inferred frommorphological data (15, 23) and DNA-DNA hy-bridization (24), postulated new relationships,and contradicted many others. Consistent withmost previous molecular and contemporary mor-phological studies (15), they divided modernbirds (Neornithes) into Palaeognathae (tinamousand flightless ratites), Galloanseres [Galliformes(landfowl) and Anseriformes (waterfowl)], andNeoaves (all other extant birds). Within Neoaves,
1320 12 DECEMBER 2014 • VOL 346 ISSUE 6215 sciencemag.org SCIENCE
A FLOCK OF GENOMES
Jarvis,$Mirarab,$et$al.,$examined$48$
bird$species$using$14,000$loci$from$
whole$genomes.$Two$trees$were$
presented.$
$
1.$A$single$dataset$maximum$
likelihood$concatena,on$analysis$
used$~300$CPU$years$and$1Tb$of$
distributed$memory,$using$TACC$and$
other$supercomputers$around$the$
world.$$
$
2.$However,$every%locus%had%a%different%%tree$–$sugges,ve$of$“incomplete$lineage$sor,ng”$–$and$
the$noisy$genomeHscale$data$required$
the$development$of$a$new$method,$
“sta,s,cal$binning”.$
$
$
$
$
Only48species,butheuris1cMLtook~300CPUyearsonmul1plesupercomputersandused1Tbofmemory
1KP:ThousandTranscriptomeProject
l 103planttranscriptomes,400-800singlecopy“genes”l Wicked,Mirarabetal.,PNAS2014l Nextphasewillbemuchbigger
l ~1000speciesand~1000genes,l andwillrequiretheinferenceofmul?plesequencealignmentsand
treesonmorethan100,000sequences
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD
Andmanyothers
Background Results Conclusions References
Genomic data: challenges and opportunities
Muir et al. [2016]
Muir,2016
ThisTalk
• Model-basedtreees1ma1onandNP-hardproblems
• Howdivide-and-conquercanimprovetreees1ma1on
• Constrainedop1miza1on• Openproblems
MarkovModelofSiteEvolu1onSimplest(Jukes-Cantor,1969):• ThemodeltreeTisbinaryandhassubs1tu1onprobabili1esp(e)on
eachedgee.• Thestateattherootisrandomlydrawnfrom{A,C,T,G}(nucleo1des)• Ifasite(posi1on)changesonanedge,itchangeswithequalprobability
toeachoftheremainingstates.• Theevolu1onaryprocessisMarkovian.Thedifferentsitesareassumedtoevolveindependentlyandiden1callydownthetree(withratesthataredrawnfromagammadistribu1on).
Morecomplexmodels(suchastheGeneralMarkovmodel)arealsoconsidered,ojenwithlidlechangetothetheory.
Ques1ons
• Isthemodeltreeiden1fiable?• Whiches1ma1onmethodsaresta1s1callyconsistentunderthismodel?
• Howmuchdatadoesthemethodneedtoes1matethemodeltreecorrectly(withhighprobability),andhowwelldothemethodsperforminprac1ce?
Sta1s1calConsistency
error
Data
Answers?
• Weknowalotaboutwhichsiteevolu1onmodelsareiden1fiable,andwhichmethodsaresta1s1callyconsistent.
• Weknowalidlebitaboutthesequencelengthrequirementsforstandardmethods.
• Extensivestudiesshowthateventhebestmethodsproducegenetreeswithsomeerror.
1 Hill-climbing heuristics for NP-hard optimization criteria (e.g., Maximum Likelihood)
Local optimum
Cost
Global optimum
Phylogenetic trees 2 Polynomial time distance-based methods: Neighbor
Joining, FastME, etc. 3. Bayesian methods
Statistically consistent phylogenetic reconstruction methods
Distance-basedes1ma1on
Quan1fyingError
FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate
FN
FP
NeighborJoining(NJ)onlargetreesSimula1onstudybaseduponfixededgelengths,K2Pmodelofevolu1on,sequencelengthsfixedto1000nucleo1des.
Errorratesreflectpropor1onofincorrectedgesininferredtrees.
[Nakhlehetal.ISMB2001]
NJ
0 400 800 1600 1200 No. Taxa
0
0.2
0.4
0.6
0.8
Erro
r Rat
e
Inotherwords…
error
Data
Sta1s1calconsistencydoesn’tguaranteeaccuracyw.h.p.unlessthesequencesarelongenough.
“Boos1ng”phylogenyreconstruc1onmethods
• DCMs“boost”theperformanceofphylogenyreconstruc1onmethods.
DCMBasemethodM DCM-M
Divide-and-conquerforphylogenyes1ma1on
MarkovModelofSiteEvolu1onSimplest(Jukes-Cantor,1969):• ThemodeltreeTisbinaryandhassubs1tu1onprobabili1esp(e)on
eachedgee.• Thestateattherootisrandomlydrawnfrom{A,C,T,G}(nucleo1des)• Ifasite(posi1on)changesonanedge,itchangeswithequalprobability
toeachoftheremainingstates.• Theevolu1onaryprocessisMarkovian.Thedifferentsitesareassumedtoevolveindependentlyandiden1callydownthetree(withratesthataredrawnfromagammadistribu1on).
Morecomplexmodels(suchastheGeneralMarkovmodel)arealsoconsidered,ojenwithlidlechangetothetheory.
Sequencelengthrequirements
Thesequencelength(numberofsites)thataphylogenyreconstruc1onmethodMneedstoreconstructthetruetreewithprobabilityatleast1-εdependson
• M(themethod)• ε• f=minp(e),• g=maxp(e),and• n,thenumberofleaves
Wefixeverythingbutn.
Statistical consistency, exponential convergence, and absolute fast convergence (afc)
Distance-basedes1ma1on
NeighborJoining’ssequencelengthrequirementisexponen1al!
• Adeson1999:LetTbeaJukes-Cantormodeltreedefiningaddi1vematrixD.ThenNeighborJoiningwillreconstructthetruetreewithhighprobabilityfromsequencesthatareoflengthO(lgnemaxDij).
• LaceyandChang2009:Matchinglowerbound
Divide-and-conquerforphylogenyes1ma1on
RefinementStep
SupertreeStep
Constructsubsettrees
DCM1Decomposi1ons
DCM1 decomposition : Compute maximal cliques
Input: Set S of sequences, distance matrix d, threshold value
1. Compute threshold graph
}),(:),{(,),,( qjidjiESVEVGq ≤===2. Perform minimum weight triangulation (note: if d is an additive matrix, then the threshold graph is provably triangulated).
}{ ijdq∈
DCM1-boos1ng:Warnow,St.John,andMoret,
SODA2001
• TheDCM1phaseproducesacollec1onoftrees(oneforeachthreshold),andtheSQSphasepicksthe“best”tree.
• Foragiventhreshold,thebasemethodisusedtoconstructtreesonsmallsubsets(definedbythethreshold)ofthetaxa.Thesesmalltreesarethencombinedintoatreeonthefullsetoftaxa.
DCM1 SQS
Exponentially converging (base) method
Absolute fast converging (DCM1-boosted) method
NeighborJoiningonlargediametertreesSimula1onstudybaseduponfixededgelengths,K2Pmodelofevolu1on,sequencelengthsfixedto1000nucleo1des.
Errorratesreflectpropor1onofincorrectedgesininferredtrees.
[Nakhlehetal.ISMB2001]
NJ
0 400 800 1600 1200 No. Taxa
0
0.2
0.4
0.6
0.8
Erro
r Rat
e
DCM1-boos1ngdistance-basedmethods[Nakhlehetal.ISMB2001]
• Theorem(Warnowetal.,SODA2001):DCM1-NJconvergestothetruetreefrompolynomiallengthsequences
NJ DCM1-NJ
0 400 800 1600 1200 No. Taxa
0
0.2
0.4
0.6
0.8
Erro
r Rat
e
DCM1-boos1ng
• DCM1-boos1ng:reducingsequencelengthrequirementsforgenetreeaccuracyfromexponen1altopolynomial
• Keyalgorithmicingredients:– Constructtriangulatedgraph– Applytreeconstruc1onmethodstosubsets– Combinesubsettreestogetherusingsupertreemethod
– Selectbesttreefromasetoftrees
DCM1-boos1ng
• DCM1-boos1ng:reducingsequencelengthrequirementsforgenetreeaccuracyfromexponen1altopolynomial
• Keyalgorithmicingredients:– Constructtriangulatedgraph– Applytreeconstruc1onmethodstosubsets– Combinesubsettreestogetherusingsupertreemethod
– Selectbesttreefromasetoftrees
Otherapplica1onsofdivide-and-conquer
• DACTAL-boos1ng:– almostalignment-freetreees1ma1on– Improvingmul1-locusspeciestreees1ma1on
• DCM2-boos1ng:improvingheuris1csearchesformaximumlikelihood
• Superfine-boos1ng:improvingsupertreemethods
Divide-and-conquerforphylogenyes1ma1on
RefinementStep
SupertreeStep
Constructsubsettrees
SupertreeProblems
• QuartetMedianSupertree• Robinson-FouldsSupertree• MatrixRepresenta1onwithParsimony• MatrixRepresenta1onwithLikelihood• Etc.AllareNP-hardbecauseeventes1ngcompa1bilityofunrootedtreesisNP-complete.
SupertreeProblems
• QuartetMedianSupertree• Robinson-FouldsSupertree• MatrixRepresenta1onwithParsimony• MatrixRepresenta1onwithLikelihood• Etc.AllareNP-hardbecauseeventes1ngcompa1bilityofunrootedtreesisNP-complete.
phylogenomics
2
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
“gene” here refers to a portion of the genome (not a functional gene)
Orangutan
Gorilla
Chimpanzee
Human
I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome
Gene tree discordance
3
Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
IncompleteLineageSor1ng(ILS)isadominantcauseofgenetreeheterogeneity
Genetreesinsidethespeciestree(CoalescentProcess)
Present
Past
CourtesyJamesDegnan
GorillaandOrangutanarenotsiblingsinthespeciestree,buttheyareinthegenetree.
IncompleteLineageSor1ng(ILS)
• Confoundsphylogene1canalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.
• Thereissubstan1aldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundsta1s1calconsistencyguarantees(theory)andperformanceondata.
Anomalyzone• Ananomalousgenetree(AGT)isonethatismoreprobablethanthetruespeciestreeunderthemul1-speciescoalescentmodel.
• Theorem(Hudson1983):Therearenorooted3-leafAGTs.
• Theorem(Allmanetal.2011,Degnan2013):Therearenounrooted4-leafAGTs.
• Theorem(Degnan2013,Rosenberg2013):Forn>3,therearemodelspeciestreeswithrootedAGTs,andforn>4therearemodelspeciestreeswithunrootedAGTs.
Anomalyzone• Ananomalousgenetree(AGT)isonethatismoreprobablethanthetruespeciestreeunderthemul1-speciescoalescentmodel.
• Theorem(Hudson1983):Therearenorooted3-leafAGTs.
• Theorem(Allmanetal.2011,Degnan2013):Therearenounrooted4-leafAGTs.
• Theorem(Degnan2013,Rosenberg2013):Forn>3,therearemodelspeciestreeswithrootedAGTs,andforn>4therearemodelspeciestreeswithunrootedAGTs.
Anomalyzone• Ananomalousgenetree(AGT)isonethatismoreprobablethanthetruespeciestreeunderthemul1-speciescoalescentmodel.
• Theorem(Hudson1983):Therearenorooted3-leafAGTs.
• Theorem(Allmanetal.2011,Degnan2013):Therearenounrooted4-leafAGTs.
• Theorem(Degnan2013,Rosenberg2013):Forn>3,therearemodelspeciestreeswithrootedAGTs,andforn>4therearemodelspeciestreeswithunrootedAGTs.
Anomalyzone• Ananomalousgenetree(AGT)isonethatismoreprobablethanthetruespeciestreeunderthemul1-speciescoalescentmodel.
• Theorem(Hudson1983):Therearenorooted3-leafAGTs.
• Theorem(Allmanetal.2011,Degnan2013):Therearenounrooted4-leafAGTs.
• Theorem(Degnan2013,Rosenberg2013):Forn>3,therearemodelspeciestreeswithrootedAGTs,andforn>4therearemodelspeciestreeswithunrootedAGTs.
Anomalyzone• Ananomalousgenetree(AGT)isonethatismoreprobablethanthetruespeciestreeunderthemul1-speciescoalescentmodel.
• Theorem(Hudson1983):Therearenorooted3-leafAGTs.
• Theorem(Allmanetal.2011,Degnan2013):Therearenounrooted4-leafAGTs.
• Theorem(Degnan2013,Rosenberg2013):Forn>3,therearemodelspeciestreeswithrootedAGTs,andforn>4therearemodelspeciestreeswithunrootedAGTs.
ASTRAL [Mirarab, et al., ECCB/Bioinformatics, 2014]
• Optimization Problem (NP-Hard):
• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly
15
Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees
Set of quartet trees induced by T
a gene tree
Score(T ) =X
t2TQ(T ) \Q(t)
all input gene trees
ConstrainedMaximumQuartetSupportTree
• Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbipar11ons
• Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT, subjecttoTdrawingitsbipar11onsfromX.
Theorems:• Mirarabetal.2014:IfXcontainsthebipar11onsfromtheinput
genetrees(andperhapsothers),thenanexactsolu1ontothisproblemissta1s1callyconsistentundertheMSC.
• MirarabandWarnow2015:TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)1me.(Weusedynamicprogramming,andbuildtheunrootedtreefromthebodom-up,basedon“allowedclades”–halvesoftheallowedbipar11ons.)
ConstrainedMaximumQuartetSupportTree
• Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbipar11ons
• Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT, subjecttoTdrawingitsbipar11onsfromX.
Theorems:• Mirarabetal.2014:IfXcontainsthebipar11onsfromtheinput
genetrees(andperhapsothers),thenanexactsolu1ontothisproblemissta1s1callyconsistentundertheMSC.
• MirarabandWarnow2015:TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)1me.(Weusedynamicprogramming,andbuildtheunrootedtreefromthebodom-up,basedon“allowedclades”–halvesoftheallowedbipar11ons.)
ConstrainedMaximumQuartetSupportTree
• Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbipar11ons
• Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT, subjecttoTdrawingitsbipar11onsfromX.
Theorems:• Mirarabetal.2014:IfXcontainsthebipar11onsfromtheinput
genetrees(andperhapsothers),thenanexactsolu1ontothisproblemissta1s1callyconsistentundertheMSC.
• MirarabandWarnow2015:TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)1me.(Weusedynamicprogramming,andbuildtheunrootedtreefromthebodom-up,basedon“allowedclades”–halvesoftheallowedbipar11ons.)
Simulation study• Variable parameters:
• Number of species: 10 – 1000
• Number of genes: 50 – 1000
• Amount of ILS: low, medium, high
• Deep versus recent speciation
• 11 model conditions (50 replicas each) with heterogenous gene tree error
• Compare to NJst, MP-EST, concatenation (CA-ML)
• Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree
14
Truegenetrees Sequencedata
Es�matedspeciestree
Finch Falcon Owl Eagle Pigeon
Es�matedgenetreesFinch Owl Falcon Eagle Pigeon
True(model)speciestree
ASTRAL-II
look at all pairs of leaves chosen each from one of the children ofu. For each such pair of leaves, there are
�u0
2
�quartet trees that put
that pair together, where u0 is the number of leaves outside the nodeu. This will examine each pair of nodes in each of the input k nodesexactly once and would therefore require O(n2k) computations.The final score can be normalized by the maximum number of inputquartet trees that include a pair of taxa.
Given the similarity matrix, we calculate an UPGMA tree andadd all its bipartitions to the set X. This heuristic adds relatively fewbipartitions, but the matrix is used in the next heuristic, which is ourmain addition mechanism.
Greedy: We estimate the greedy consensus of the gene trees atdifferent threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3).For each polytomy in each greedy consensus tree, we resolve thepolytomy in multiple ways and add bipartitions implied by thoseresolutions to the set X. First, we resolve the polytomy by applyingthe UPGMA algorithm to the similarity matrix, starting from theclades given by the polytomy. Then, we sample one taxon fromeach side of the ploytomy randomly, and use the greedy consensusof the gene trees restricted to this subsample to find a resolutionof the polytomy (we randomly resolve any multifunctions in thisgreedy consensus on indued subsample). We repeat this process atleast 10 times, but if the subsampled greedy consensus trees includesufficiently frequent bipartitions (defined as > 1%), we do morerounds of random sampling (we increase the number of iterationsby two every time this happens). For each random subsamplearound a polytomy, we also resolve it by calculating an UPGMAtree on the subsampled similarity matrix. Finally, for the two firstgreedy threshold values and the first 10 random subsamples, wealso use a third strategy that can potentially add a larger number ofbipartitions. For each subsampled taxon x, we resolve the polytomyas a caterpillar tree by sorting the remaining taxa according to theirsimilarity with x.
Gene tree polytomies: When gene trees include polytomies, wealso add new bipartitions to set X. We first compute the greedyconsensus of the input gene trees with threshold 0 and if thegreedy consensus has polytomies, we resolve them using UPGMA;we repeat this process twice to account for uncertainty in greedyconsensus estimation. Then, for each gene tree polytomy, we use thetwo resolved consensus trees to infer a resolution of the polytomyand we add the implied resolutions to set X.
3.3 Multi-furcating input gene trees
Extending ASTRAL to inputs that include polytomies requiressolving the weighted quartet tree problem when each node of theinput defines not a tripartition, but a multi-partition of the setof taxa. We start by a basic observation: every resolved quartettree induced by a gene tree maps to two nodes in the gene treeregardless of whether the gene tree is binary or not. In other words,induced quartet trees that map to only one node of the gene tree areunresolved. When maximizing the quartet support, these unresolvedgene tree quartet trees are inconsequential and need to be ignored.Now, consider a polytomy of degree d. There are
�d3
�ways to select
three sides of the polytomy. Each of these ways of selecting threesides defines a tripartition of a subset of taxa. Any selection of twotaxa from one side of this tripartition and one taxon from each of theremaining two sides still defines an induced resolved quartet tree,
0
5
10
15
20
0% 20% 40% 60% 80%RF distance (true species tree vs true gene trees)
dens
ity
rate1e−06 1e−07
tree height10M 2M 500K
(a) True gene tree discordance
0
1
2
3
4
0% 25% 50% 75% 100%RF distance (true vs estimated)
dens
ity
(b) Gene tree estimation error
Fig. 1. Characteristics of the simulation (a) RF distance between the truespecies tree and the true gene trees (50 replicates of 1000 genes) for DatasetI. Tree height directly affects the amount of true discordance; the speciationrate affects true gene tree discordance only with 10M tree length. (b) RFdistance between true gene trees and estimated gene trees for Dataset I. Seealso Figure S1 for inter and intra-replicate gene tree error distributions.
and each induced resolved quartet tree would still map to exactlytwo nodes in our multi-furcating tree. Thus, all the algorithmicassumptions of ASTRAL remain intact, as long as for each multi-furcating node in an input gene tree, we treat it as a collection of
�d3
�
tripartitions. Note that in the presence of polytomies, the runningtime analysis can change because analyzing each multi-furcatingnode requires time cubic in its degree and the degree can increase inprinciple with n. Thus, the running time depends on the patterns ofthe multi-furcations and cannot be studied in a general case.
Statistical Consistency: ASTRAL-I was statistically consistent, andchanges from ASTRAL-I to ASTRAL-II either affect running time,or enlarge the search space, which does not negate consistency.
Theorem 3: ASTRAL-II is statistically consistent for binarycomplete input gene trees.
4 EXPERIMENTAL SETUPSimulation Procedure: We used SimPhy, a tool developed by Malloet al. (2015), to simulate species trees and gene trees (producedin mutation units), and then used Indelible to simulate sequencesdown the gene trees with varying length and model parameters. Weestimated gene trees on these simulated gene alignments, which wethen used in coalescent-based analyses.
We simulated 10 model conditions, which we divide into twodatasets, with one model condition appearing in both datasets. We
3
UsedSimPhy,MalloandPosada,2015
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
Otherpolynomial1mealgorithmsforconstrainedop1miza1onproblems
• MinimizeDuplica1on/LossSupertree(HalledandLagergren2000)
• QuartetSupport(BryantandSteel2001)• ConstrainedMinimizeDeepCoalescence(ThanandNakhleh2000,Yu,Warnow,andNakhleh2011).
• Genetreees1ma1onunderDuplica1onandLoss(Szöllősietal.2013)
• ConstrainedRobinson-FouldsSupertree(Vachaspa1andWarnow2016)
Summary
• NP-hardop1miza1onproblemsaboundinphylogenyreconstruc1on,andincomputa1onalbiologyingeneral,andneedveryaccuratesolu1ons.
• Divide-and-conquertechniquescanprovidegreatlyimprovedaccuracyandscalability,andexcellentsta1s1calperformanceguarantees.Butdivide-and-conquermethodsdependonhavinggoodsupertreemethods.
• Constrainedexactop1miza1onhasbeensurprisinglybeneficialforphylogenomicanalysis.
Summary
• NP-hardop1miza1onproblemsaboundinphylogenyreconstruc1on,andincomputa1onalbiologyingeneral,andneedveryaccuratesolu1ons.
• Divide-and-conquertechniquescanprovidegreatlyimprovedaccuracyandscalability,andexcellentsta1s1calperformanceguarantees.Butdivide-and-conquermethodsdependonhavinggoodsupertreemethods.
• Constrainedexactop1miza1onhasbeensurprisinglybeneficialforphylogenomicanalysis.
Summary
• NP-hardop1miza1onproblemsaboundinphylogenyreconstruc1on,andincomputa1onalbiologyingeneral,andneedveryaccuratesolu1ons.
• Divide-and-conquertechniquescanprovidegreatlyimprovedaccuracyandscalability,andexcellentsta1s1calperformanceguarantees.Butdivide-and-conquermethodsdependonhavinggoodsupertreemethods.
• Constrainedexactop1miza1onhasbeensurprisinglybeneficialforphylogenomicanalysis.
OpenProblems• Whatotherop1miza1onproblemsaretractable,given
constraintsets?• HowshouldwedefinethesetXofconstraintsfromasetof
sourcetrees,especiallywhenthesourcetreesareincomplete?
• Arethereotherwaysofconstrainingthesearchspacethatareeffec1veanduseful?
• Whatareothereffec1vewaysofapproachingtrulylarge-scalephylogenyes1ma1on?
• Candivide-and-conquerbeemployedwithBayesianmethods?(Or,moregenerally,howcanwemakeBayesianmethodsmorescalable?)
Acknowledgments
NSFgrantDBI-1461364(jointwithNoahRosenbergatStanfordandLuayNakhlehatRice): hdp://tandy.cs.illinois.edu/PhylogenomicsProject.htmlPapersavailableathdp://tandy.cs.illinois.edu/papers.htmlASTRAL:Availableathdps://github.com/smirarabFastRFS:Availableathdps://github.com/pranjalv123/FastRFS
ConstrainedRobinson-FouldsSupertree
• Input:setT ofsourcetrees,andsetXofallowedbipar11ons
• Output:treeTthatminimizesthetotalRobinson-Fouldsdistancetotheinputsourcetrees,andthatdrawsitsbipar11onsfromX.
Theorem(Vachaspa1andWarnow2016):TheonstrainedRFSupertreecanbesolvedinO(|X|2nk)1me,wherethereareksourcetreesandnspecies.
Avian Phylogenomics Project Erich Jarvis, HHMI
Guojie Zhang, BGI
• Approx. 50 species, whole genomes • 14,000 loci • Multi-national team (100+ investigators) • 8 papers published in special issue of Science 2014
Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing “coalescent-based” species tree from 14,000 different gene trees
MTP Gilbert, Copenhagen
Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC
…ACGGTGCAGTTACC-A…
…AC----CAGTCACCTA…
Thetruemul1plealignment– Reflects historical substitution, insertion, and deletion
events – Defined using transitive closure of pairwise alignments
computed on edges of the true tree
…ACGGTGCAGTTACCA…
Substitution Deletion
…ACCAGTCACCTA…
Insertion
1KP:ThousandTranscriptomeProject
l 103planttranscriptomes,400-800singlecopy“genes”l Wicked,Mirarabetal.,PNAS2014l Nextphasewillbemuchbigger
l ~1000speciesand~1000genes,l andwillrequiretheinferenceofmul?plesequencealignmentsand
treesonmorethan100,000sequences
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD
Andmanyothers
Constrainedop1miza1on
• FirstproposedinHalletandLagergren2000,fortheduplica1on-lossspeciestreeproblem
• Algorithmsforconstrainedop1miza1onusedynamicprogrammingtofindop1malsolu1ons,construc1ngthebesttreefromthe“bodom-up”
• ThesetXisusuallydefinedtobethebipar11onsfromthesourcetrees.
DACTAL
New supertree method: SuperFine
Existing Method: RAxML(MAFFT)
pRecDCM3
BLAST-based
Overlapping subsets
A tree for each subset
Unaligned Sequences
A tree for the entire dataset
Resultsonthreebiologicaldatasets–6000to28,000sequences.Weshowresultswith5DACTALiteraKons
DACTAL
Construct supertree
Construct subset trees
Decompose using tree
Overlapping subsets
A tree for each subset
Arbitrary input
A tree for the entire dataset
TriangulatedGraphs
• Defini1on:Agraphistriangulatedifithasnosimplecyclesofsizefourormore.
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Samplingmul1plegenesfrommul1plespecies
Standardheuris1csearch
T
T’
Hill-climbingRandomperturba1on
Problems with current techniques for MP
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 4 8 12 16 20 24
Hours
Average MP score above
optimal, shown as a percentage of
the optimal
Shown here is the performance of the TNT software for maximum parsimony on a real dataset of almost 14,000 sequences. The required level of accuracy with respect to MP score is no more than 0.01% error (otherwise high topological error results). (“Optimal” here means best score to date, using any method for any amount of time.)
Performance of TNT with time
SolvingNP-hardproblemsexactlyis…unlikely
• Numberof(unrooted)binarytreesonnleavesis(2n-5)!!
• Ifeachtreeon1000taxacouldbeanalyzedin0.001seconds,wewouldfindthebesttreein
2890millennia
#leaves #trees4 35 156 1057 9458 103959 13513510 202702520 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900
Gene tree discordance
3
Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
AvianPhylogenomicsProjectEJarvis,HHMI
GZhang,BGI
• Approx.50species,wholegenomes,14,000loci
MTPGilbert,Copenhagen
S.MirarabMd.S.Bayzid,UT-Aus1nUT-Aus1n
T.WarnowUT-Aus1n
Plusmanymanyotherpeople…
Challenges:• Concatena1onanalysisused>200CPUyears
Butalso• Massivegenetreeconflictsugges1veofILS,andmostgenetreeshadverylowbootstrapsupport• Coalescent-basedanalysisusingMP-ESTproducedtreethatconflictedwithconcatena1onanalysis
Bigdatasetsandhardproblems
• Computa1onallyintensiveproblems:– Mul1plesequencealignment– Maximumlikelihoodgenetreees1ma1on– Speciestreees1ma1onfrommul1plegenetrees
• Fastmethodsaregenerallynotsufficientlyaccurate
• Accuratemethodsaregenerallycomputa1onallyintensive
Quantifying Error
FN: false negative (missing edge)
FP: false positive (incorrect edge)
FP
50% error rate
FN
Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001]
Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining!
NJ
0 400 800 No. Taxa
1600 1200
0.4 0.2
0
0.6
0.8
Erro
r Rat
e
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Species Tree Estimation requires multiple genes!
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Speciestreees1ma1on:difficult,evenforsmalldatasets!
Major Challenges: large datasets, fragmentary sequences
• Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.
• Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).
• Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.
• Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution
Both phylogenetic estimation and multiple sequence alignment are also
impacted by fragmentary data.
Major Challenges: large datasets, fragmentary sequences
• Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.
• Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).
• Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.
• Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution
Both phylogenetic estimation and multiple sequence alignment are also
impacted by fragmentary data.
Strict Consensus Merger (SCM)a b
c d
e
f g
a b
c d h
i j
e
f g
h i j
a b
c
d
a b
c
d
e
f g
a b
c
d h
i j
Large-scale statistical phylogeny estimation Ultra-large multiple-sequence alignment Estimating species trees from incongruent gene trees Supertree estimation Genome rearrangement phylogeny Reticulate evolution
Visualization of large trees and alignments Data mining techniques to explore multiple optima
The Tree of Life: Multiple Challenges
Largedatasets:100,000+sequences10,000+genes“BigData”complexity
Scien1ficchallenges:• Ultra-largemul1ple-sequencealignment• Alignment-freephylogenyes1ma1on• Supertreees1ma1on• Es1ma1ngspeciestreesfrommanygenetrees• Genomerearrangementphylogeny• Re1culateevolu1on• Visualiza1onoflargetreesandalignments• Dataminingtechniquestoexploremul1pleop1ma• Theore1calguaranteesunderMarkovmodelsofevolu1on
Applica1ons:• metagenomics• proteinstructureandfunc1onpredic1on• traitevolu1on• detec1onofco-evolu1on• systemsbiology
TheTreeofLife:Mul?pleChallenges
Techniques:• Graphtheory(especiallychordalgraphs)• Probabilitytheoryandsta1s1cs• HiddenMarkovmodels• Combinatorialop1miza1on• Heuris1cs• Supercompu1ng
1kp:ThousandTranscriptomeProject
l PlantTreeofLifebasedontranscriptomesof~1200speciesl Morethan13,000genefamilies(mostnotsinglecopy)l Firstpaper:PNAS2014(~100speciesand~800loci)• GeneTreeIncongruence
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin
Plus many many other people…
UpcomingChallenges(~1200species,~400loci):• Speciestreees1ma1onunderthemul1-speciescoalescentfromhundredsofconflic1nggenetreeson>1000species(wewilluseASTRAL–Mirarabetal.2014,Mirarab&Warnow 2015)• Mul1plesequencealignmentof>100,000sequences(withlotsoffragments!)–wewilluseUPP(Nguyenetal.,2015)
phylogenomics
2
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
“gene” here refers to a portion of the genome (not a functional gene)
Orangutan
Gorilla
Chimpanzee
Human
I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome