cs 581 / bioe 540: algorithmic computaonal genomicstandy.cs.illinois.edu/581-intro-to-course.pdf ·...
TRANSCRIPT
CS581/BIOE540:AlgorithmicComputa<onalGenomics
TandyWarnowDepartmentsofBioengineeringandComputerSciencehHp://tandy.cs.illinois.edu
CourseDetails
Officehours:Tuesdays12:30-1:30(Siebel3235)Coursewebpage: hHp://tandy.cs.illinois.edu/581-2017.htmlTextbook:Computa<onalPhylogene<cs,availablefordownloadathHp://tandy.cs.illinois.edu/textbook.pdfTA:PranjalVachaspa<(tobeconfirmed)
Today
• Describesomeimportantproblemsincomputa<onalbiology,forwhichstudentsinthiscoursecoulddevelopimprovedmethods
• Explainhowthecoursewillberun• Answerques<ons
ThisCourseTopics:computa<onalandsta<s<calproblemsinsequenceanalysis(e.g.,mul<plesequencealignment,phylogenyes<ma<on,metagenomics,etc.).Focus:understandingthemathema<calfounda<ons,anddesigningalgorithmswithoutstandingaccuracyandspeedonlarge,complexdatasets.Thisisnotacourseabouthowtousethetools.
Prerequisites
Nobackgroundinbiologyisneeded.However,thecoursehasthefollowingprerequisites:
• CS374:computa<onalcomplexity,algorithmdesigntechniques,andprovingtheoremsaboutalgorithms
• CS361:probabilityandsta<s<cs
• Byrecursion,CS225:programming
Ifyouhaven’tsa<sfiedthepre-reqs:
Youneedpermissiontostayinthecourse.
• Thefirsthomeworkisdue(byemail)onSaturdayat1PM.SeethehomeworkwebpagehHp://tandy.cs.illinois.edu/cs581-2017-hw.html
• Thenmakeanappointmenttoseemetoreviewthehomework.
Thiscourse
• Phylogenyes<ma<onbasedonstochas<cmodelsofsequenceevolu<onandgenomeevolu<on
• Mul<plesequencealignment• Applica<onstometagenomics,protein
structurepredic<on,andotherbiologicalproblems
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Species Tree
Evolu<oninformsabouteverythinginbiology
• Biggenomesequencingprojectsjustproducedata------sowhat?
• Evolu<onaryhistoryrelatesallorganismsandgenes,andhelpsusunderstandandpredict– interac<onsbetweengenes(gene<cnetworks)– drugdesign– predic<ngfunc<onsofgenes– influenzavaccinedevelopment– originsandspreadofdisease– originsandmigra<onsofhumans
Constructing the Tree of Life: Hard Computational Problems
NP-hardproblemsLargedatasets
100,000+sequencesthousandsofgenes
“Bigdata”complexity:
modelmisspecifica<onfragmentarysequenceserrorsininputdatastreamingdata
Phylogenomicpipeline
SelecttaxonsetandmarkersGatherandscreensequencedata,possiblyiden<fyorthologsComputemul<plesequencealignmentsforeachlocusComputespeciestreeornetwork:
Computegenetreesonthealignmentsandcombinethees<matedgenetrees,OREs<mateatreefromaconcatena<onofthemul<plesequencealignments
Getsta<s<calsupportoneachbranch(e.g.,bootstrapping)Es<matedatesonthenodesofthephylogenyUsespeciestreewithbranchsupportanddatestounderstandbiology
Phylogenomicpipeline
SelecttaxonsetandmarkersGatherandscreensequencedata,possiblyiden<fyorthologsComputemul<plesequencealignmentsforeachlocusComputespeciestreeornetwork:
Computegenetreesonthealignmentsandcombinethees<matedgenetrees,OREs<mateatreefromaconcatena<onofthemul<plesequencealignments
Getsta<s<calsupportoneachbranch(e.g.,bootstrapping)Es<matedatesonthenodesofthephylogenyUsespeciestreewithbranchsupportanddatestounderstandbiology
Research Strategies
Improvedalgorithmsthrough:• Divide-and-conquer• “Bin-and-conquer”• Iteration• Bayesiansta<s<cs• HiddenMarkovModels• Graphtheory• Combinatorialop<miza<on
Sta<s<calmodellingMassiveSimula<onsHighPerformanceCompu<ng
AvianPhylogenomicsProject
GZhang,BGI
• Approx.50species,wholegenomes• 14,000loci
MTPGilbert,Copenhagen
S.MirarabMd.S.Bayzid,UT-Aus<nUT-Aus<n
T.WarnowUT-Aus<n
Plusmanymanyotherpeople…
ErichJarvis,HHMI
Science,December2014(Jarvis,Mirarab,etal.,andMirarabetal.)
1kp:ThousandTranscriptomeProject
l PlantTreeofLifebasedontranscriptomesof~1200speciesl Morethan13,000genefamilies(mostnotsinglecopy)l Firstpaper:PNAS2014(~100speciesand~800loci)GeneTreeIncongruence
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin
Plus many many other people…
UpcomingChallenges(~1200species,~400loci)
DNA Sequence Evolution
AAGACTT
TGGACTT AAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTT AAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTT AGCACAA TAGACTT TAGCCCA AGGGCAT
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT
U V W X Y
U
V W
X
Y
Performance criteria • Running time • Space • Statistical performance issues (e.g., statistical
consistency) with respect to a Markov model of evolution
• “Topological accuracy” with respect to the underlying true tree or true alignment, typically studied in simulation
• Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real data
Quantifying Error
FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate
FN
FP
Statistical consistency, exponential convergence, and absolute fast convergence (afc)
1 Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood)
Local optimum
Cost
Global optimum
Phylogenetic trees 2 Polynomial time distance-based methods: Neighbor
Joining, FastME, etc. 3. Bayesian methods
Phylogenetic reconstruction methods
Solving maximum likelihood (and other hard optimization problems) is… unlikely
# of Taxa
# of Unrooted Trees
4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900
Quantifying Error
FN: false negative (missing edge)
FP: false positive (incorrect edge)
FP
50% error rate
FN
Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001]
Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining!
NJ
0 400 800 No. Taxa
1600 1200
0.4 0.2
0
0.6
0.8
Erro
r Rat
e
Major Challenges
• Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT
U V W X Y
U
V W
X
Y
AGAT TAGACTT TGCACAA TGCGCTT AGGGCATGA
U V W X Y
X U
Y
V W
TheRealProblem!
…ACGGTGCAGTTACC-A…
…AC----CAGTCACCTA…
The true multiple alignment – Reflects historical substitution, insertion, and deletion events – Defined using transitive closure of pairwise alignments computed on
edges of the true tree
…ACGGTGCAGTTACCA… Insertion
Substitution Deletion
…ACCAGTCACCTA…
Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
Phase 1: Alignment
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
Phase 2: Construct tree
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
S1
S4
S2
S3
Two-phase estimation Alignment methods • Clustal • POY (and POY*) • Probcons (and Probtree) • Probalign • MAFFT • Muscle • Di-align • T-Coffee • Prank (PNAS 2005, Science 2008) • Opal (ISMB and Bioinf. 2007) • FSA (PLoS Comp. Bio. 2009) • Infernal (Bioinf. 2009) • Etc.
Phylogeny methods • Bayesian MCMC• Maximum
parsimony• Maximum likelihood• Neighbor joining• FastME• UPGMA• Quartet puzzling• Etc.
Simulation Studies
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA
S1 S2
S4 S3
True tree and alignment
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
Compare
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA
S1 S4
S2 S3
Estimated tree and alignment
Unaligned Sequences
1000taxonmodels,orderedbydifficulty(Liuetal.,2009)
Multiple Sequence Alignment (MSA): another grand challenge1
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA
Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation
1 Frontiers in Massive Data Analysis, National Academies Press, 2013
Major Challenges
• Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)
• Multiple sequence alignment: key step for many biological questions (protein structure and function, phylogenetic estimation), but few methods can run on large datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.
(Phylogenetic estimation from whole genomes)
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Species Tree Estimation requires multiple genes!
Two basic approaches for species tree estimation
• Concatenate (“combine”) sequence alignments for different genes, and run phylogeny estimation methods
• Compute trees on individual genes and
combine gene trees
Usingmul<plegenes
gene 1S1
S2
S3
S4
S7
S8
TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC
TATAACGGAA
gene 3TATTGATACA
TCTTGATACC
TAGTGATGCA
CATTCATACC
TAGTGATGCA
S1
S3
S4
S7
S8
gene 2GGTAACCCTCGCTAAACCTC
GGTGACCATC
GCTAAACCTC
S4
S5
S6
S7
Concatena<on gene 1
S1
S2
S3
S4
S5
S6
S7
S8
gene 2 gene 3 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA
TCTAATGGAC
TATAACGGAA
GGTAACCCTCGCTAAACCTC
GGTGACCATC
GCTAAACCTC
TATTGATACA
TCTTGATACC
TAGTGATGCA
CATTCATACC
TAGTGATGCA ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
Redgenetree≠speciestree(greengenetreeokay)
GeneTreeIncongruence
Genetreescandifferfromthespeciestreedueto:
Duplica<onandlossHorizontalgenetransferIncompletelineagesor<ng(ILS)
IncompleteLineageSor<ng(ILS)
Confoundsphylogene<canalysisformanygroups:
HominidsBirdsYeastAnimalsToadsFishFungi
Thereissubstan<aldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS.
LineageSor<ng
Popula<on-levelprocess,alsocalledthe “Mul<-speciescoalescent”(Kingman,1982)Genetreescandifferfromspeciestreesduetoshort<mesbetweenspecia<oneventsorlargepopula<onsize;thisiscalled“IncompleteLineageSor<ng”or“DeepCoalescence”.
TheCoalescent
Present
Past
CourtesyJamesDegnan
GenetreeinaspeciestreeCourtesyJamesDegnan
Keyobserva<on:Underthemul<-speciescoalescentmodel,thespeciestreedefinesaprobabilitydistribu4ononthegenetrees,andisiden4fiablefromthedistribu4onongenetrees
CourtesyJamesDegnan
. . .
Analyzeseparately
Summary Method
Twocompe<ngapproaches
gene 1 gene 2 . . . gene k
. . . Concatenation
Species
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Speciestreees<ma<on:difficult,evenforsmalldatasets!
Major Challenges: large datasets, fragmentary sequences
• Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.
• Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).
• Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.
• Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution
Both phylogenetic estimation and multiple sequence alignment are also
impacted by fragmentary data.
Major Challenges: large datasets, fragmentary sequences
• Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.
• Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).
• Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.
• Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution
Both phylogenetic estimation and multiple sequence alignment are also
impacted by fragmentary data.
AvianPhylogenomicsProject
GZhang,BGI
• Approx.50species,wholegenomes• 14,000loci
MTPGilbert,Copenhagen
S.MirarabMd.S.Bayzid,UT-Aus<nUT-Aus<n
T.WarnowUT-Aus<n
Plusmanymanyotherpeople…
ErichJarvis,HHMI
Challenges:• Speciestreees<ma<onunderthemul<-speciescoalescentmodel,from14,000poores<matedgenetrees,allwithdifferenttopologies(weused“sta<s<calbinning”)• Maximumlikelihoodes<ma<ononamillion-sitegenome-scalealignment–250CPUyears
Science,December2014(Jarvis,Mirarab,etal.,andMirarabetal.)
1kp:ThousandTranscriptomeProject
l PlantTreeofLifebasedontranscriptomesof~1200speciesl Morethan13,000genefamilies(mostnotsinglecopy)l Firstpaper:PNAS2014(~100speciesand~800loci)GeneTreeIncongruence
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin
Plus many many other people…
UpcomingChallenges(~1200species,~400loci):• Speciestreees<ma<onunderthemul<-speciescoalescentfromhundredsofconflic<nggenetreeson>1000species(wewilluseASTRAL–Mirarabetal.2014,Mirarab&Warnow
2015)• Mul<plesequencealignmentof>100,000sequences(withlotsoffragments!)–wewilluseUPP(Nguyenetal.,2015)
Constructing the Tree of Life: Hard Computational Problems
NP-hardproblemsLargedatasets
100,000+sequencesthousandsofgenes
“Bigdata”complexity:
modelmisspecifica<onfragmentarysequenceserrorsininputdatastreamingdata
Research Strategies
Improvedalgorithmsthrough:• Divide-and-conquer• “Bin-and-conquer”• Iteration• Bayesiansta<s<cs• HiddenMarkovModels• Graphtheory• Combinatorialop<miza<on
Sta<s<calmodellingMassiveSimula<onsHighPerformanceCompu<ng
Evolu<oninformsabouteverythinginbiology
• Biggenomesequencingprojectsjustproducedata------sowhat?
• Evolu<onaryhistoryrelatesallorganismsandgenes,andhelpsusunderstandandpredict– interac<onsbetweengenes(gene<cnetworks)– drugdesign– predic<ngfunc<onsofgenes– influenzavaccinedevelopment– originsandspreadofdisease– originsandmigra<onsofhumans
Metagenomics: Venter et al., Exploring the Sargasso Sea:
Scientists Discover One Million New Genes in Ocean Microbes
Metagenomic data analysis NGS data produce fragmentary sequence data Metagenomic analyses include unknown
species Taxon identification: given short sequences,
identify the species for each fragment Applications: Human Microbiome Issues: accuracy and speed
MihaiPop,UnivMaryland
Metagenomictaxoniden6fica6on
Objec<ve:classifyshortreadsinametagenomicsample
PossibleIndo-Europeantree(Ringe,WarnowandTaylor2000)
Anatolian
Albanian
Tocharian
Greek
Germanic
Armenian
Italic
Celtic
Baltic Slavic
Vedic Iranian
“PerfectPhylogene<cNetwork” forIENakhlehetal.,Language2005
Anatolian
Albanian
Tocharian
Greek
Germanic
Armenian
Baltic Slavic
Vedic Iranian
Italic
Celtic
Grading
• Homework:25%(onehwdropped)• Midterm:40%(March30)• FinalProject:25%(dueMay6)• CoursePar<cipa<on:10%
Nofinalexam.
HomeworkAssignments
• HomeworkassignmentsarelistedathHp://tandy.cs.illinois.edu/cs581-2017-hw.htmlandaredueat1PM(inpersonorviaemail)–latehomeworkshavereducedcreditandwillnotbeacceptedayer48hourspastthedeadline.
• Youareencouragedtoworkwithothersonyourhomework,butyoumustwriteupsolu<onsbyyourselfandindicatewhoyouworkedwithoneachhomework.
CourseSchedule
AdetailedcoursescheduleisathHp://tandy.cs.illinois.edu/cs581-2017-detailed-syllabus.htmlThisscheduleincludesmaterialyouareexpectedtohavelookedatbeforecomingtoclass:• assignedreading(fromtextbookand/or
scien<ficliterature)• PPTand/orPDFofmylecture
FinalProjectandClassPresenta<on
Eitherresearchproject(canbewithanotherstudent)orsurveypaper(donebyyourself).Manyinteres<ngandpublishableproblemstoaddress:seehHp://tandy.cs.illinois.edu/topics.htmlYourclasspresenta<onshouldberelatedtoyourfinalproject.
AcademicIntegrity
PleaseseecoursewebsiteathHp://tandy.cs.illinois.edu/581-2017.htmlandalsohHp://tandy.cs.illinois.edu/ethics.pdfForthiscourse:• Examinethepolicyaboutcollabora<on• Learnandunderstandwhatplagiarismis(and
thendon’tdoit).Thisappliestohomework,allwri<ngassignments,andthefinalproject.
CourseResearchProjects
• Evalua<ngexis<ngmethodsonsimulatedandreal(biologicalorlinguis<c)datasets
• Designinganewmethod,andestablishingitsperformance(usingtheoryanddata)
• Analyzingabiologicaldatasetusingseveraldifferentmethods,toaddressbiology
ExamplesofpublishedcourseprojectsMdS.Bayzid,T.Hunt,andT.Warnow."DiskCoveringMethodsImprovePhylogenomicAnalyses".ProceedingsofRECOMB-CG(Compara<veGenomics),2014,andBMCGenomics2014,15(Suppl6):S7.T.Zimmermann,S.MirarabandT.Warnow."BBCA:Improvingthescalabilityof*BEASTusingrandombinning".ProceedingsofRECOMB-CG(Compara<veGenomics),2014,andBMCGenomics2014,15(Suppl6):S11.J.Chou,A.Gupta,S.Yaduvanshi,R.Davidson,M.Nute,S.MirarabandT.Warnow.“Acompara<vestudyofSVDquartetsandothercoalescent-basedspeciestreees<ma<onmethods”.RECOMB-Compara<veGenomicsandBMCGenomics,2015.,2015,16(Suppl10):S2.P.Vachaspa<andT.Warnow(2016).FastRFS:FastandAccurateRobinson-FouldsSupertreesusingConstrainedExactOp<miza<onBioinforma<cs2016;doi:10.1093/bioinforma<cs/btw600.(SpecialissueforpapersfromRECOMB-CG)
ResearchProjectsyoucouldjoin• Phylogenomicsprojects(Avianandthe1KP)
• Speciestreeandnetworkes<ma<onfromconflic<nggenes• Large-scalemul<plesequencealignment• Large-scalemaximumlikelihoodtreees<ma<on• Improvinggenetreees<ma<onusingwholegenomes
• Metagenomics(withMihaiPop,UniversityofMaryland,andBillGropp)• Iden<fyinggenesandtaxafromshortsequences• Metagenomicassembly• Applica<onstoclinicaldiagnos<cs
• ProteinSequenceAnalysis(withJianPeng)• Whatfunc<onandstructuredoesthisproteinhave?• Howdidstructureandfunc<onevolve?
• HistoricalLinguis<cs(withDonaldRinge,UPenn)• HowdidIndo-Europeanevolve?• Designingandimplemen<ngsta<s<cales<ma<onmethodsfor
languagephylogenies