simon whelan university of manchester

32
Simon Whelan University of Manchester A simple model for a complex world Isaac Newton Institu te

Upload: aida

Post on 24-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

A simple model for a complex world. Simon Whelan University of Manchester. Isaac Newton Institute. Modelling sequence evolution. Seq1 TCTTTATTGACGTGTATGGACAATTC... Seq2 TCTTTGTTAACGTGCATGGACAATTC... Seq3 TCCTTGCTAACATGCATGGACAATTC... Seq4 TCTTTGCTAACGTGCATGGATAATTC... - PowerPoint PPT Presentation

TRANSCRIPT

  • Simon WhelanUniversity of ManchesterA simple model for a complex worldIsaac Newton Institute

  • Seq1 TCTTTATTGACGTGTATGGACAATTC...Seq2 TCTTTGTTAACGTGCATGGACAATTC...Seq3 TCCTTGCTAACATGCATGGACAATTC...Seq4 TCTTTGCTAACGTGCATGGATAATTC...Seq5 TCTT---TAACGTGCATAGATAACTC...Seq6 TCAC---TAACATGTATAGATAACTC...Seq7 TCTCTTCTAACGTGCATTGTGAAGTC...Seq8 TCTCTTTTGACATGTATTGAAAAATC...Modelling sequence evolutionSimple models assume: All sites evolving to the same process All parts of the tree evolve to the same processAACGTGTC

  • Seq1 TCTTTATTGACGTGTATGGACAATTC...Seq2 TCTTTGTTAACGTGCATGGACAATTC...Seq3 TCCTTGCTAACATGCATGGACAATTC...Seq4 TCTTTGCTAACGTGCATGGATAATTC...Seq5 TCTT---TAACGTGCATAGATAACTC...Seq6 TCAC---TAACATGTATAGATAACTC...Seq7 TCTCTTCTAACGTGCATTGTGAAGTC...Seq8 TCTCTTTTGACATGTATTGAAAAATC...Spatial heterogeneity in sequence evolutionAlso known as pattern heterogeneityAACGTGTCAACGTGTCRate = 0.5Rate = 1.0Rate = 2.0

  • Seq1 TCTTTATTGACGTGTATGGACAATTC...Seq2 TCTTTGTTAACGTGCATGGACAATTC...Seq3 TCCTTGCTAACATGCATGGACAATTC...Seq4 TCTTTGCTAACGTGCATGGATAATTC...Seq5 TCTT---TAACGTGCATAGATAACTC...Seq6 TCAC---TAACATGTATAGATAACTC...Seq7 TCTCTTCTAACGTGCATTGTGAAGTC...Seq8 TCTCTTTTGACATGTATTGAAAAATC...Spatial heterogeneity in sequence evolutionAlso known as pattern heterogeneityAACGTGTCAACGTGTCRate = 0.5Rate = 1.0Rate = 2.0

  • Seq1 TCTTTATTGACGTGTATGGACAATTC...Seq2 TCTTTGTTAACGTGCATGGACAATTC...Seq3 TCCTTGCTAACATGCATGGACAATTC...Seq4 TCTTTGCTAACGTGCATGGATAATTC...Seq5 TCTT---TAACGTGCATAGATAACTC...Seq6 TCAC---TAACATGTATAGATAACTC...Seq7 TCTCTTCTAACGTGCATTGTGAAGTC...Seq8 TCTCTTTTGACATGTATTGAAAAATC...Spatial heterogeneity in sequence evolutionAlso known as pattern heterogeneityAACGTGTCAACGTGTCRate = 0.5Rate = 1.0Rate = 2.0

  • Seq1 TCTTTATTGACGTGTATGGACAATTC...Seq2 TCTTTGTTAACGTGCATGGACAATTC...Seq3 TCCTTGCTAACATGCATGGACAATTC...Seq4 TCTTTGCTAACGTGCATGGATAATTC...Seq5 TCTT---TAACGTGCATAGATAACTC...Seq6 TCAC---TAACATGTATAGATAACTC...Seq7 TCTCTTCTAACGTGCATTGTGAAGTC...Seq8 TCTCTTTTGACATGTATTGAAAAATC...Temporal heterogeneity in sequence evolutionAACGTGTCAACGTGTCRate = 0.5Rate = 1.0Rate = 2.0

  • Seq1 TCTTTATTGACGTGTATGGACAATTC...Seq2 TCTTTGTTAACGTGCATGGACAATTC...Seq3 TCCTTGCTAACATGCATGGACAATTC...Seq4 TCTTTGCTAACGTGCATGGATAATTC...Seq5 TCTT---TAACGTGCATAGATAACTC...Seq6 TCAC---TAACATGTATAGATAACTC...Seq7 TCTCTTCTAACGTGCATTGTGAAGTC...Seq8 TCTCTTTTGACATGTATTGAAAAATC...Temporal heterogeneity in sequence evolutionAACGTGTCAACGTGTCRate = 0.5Rate = 1.0Rate = 2.0

  • MotivationBiologicalNot including heterogeneity leads to inaccurate inferences (Naylor; Lockhart)Form of heterogeneity is poorly characterisedUnderstanding heterogeneity may lead to biological insights

    ModellingNeeds to describe general heterogeneity (Warnow)Must be identifiable (Rhodes; Allman)Should be computationally efficient: Few parameters Small(-ish) number of states Applicable to tree search

  • General type of modelDescribes temporal and spatial heterogeneityAllows simple likelihood computation (reversible; stationary; i.i.d.)

    Previous incarnationsMostly examine temporal and spatial rate variationCovarion model of Tuffley and Steel and its progenyOther names include: Markov modulated Markov processes (models) Switching processes Covarion-likeTemporal hidden Markov models (THMMs)

  • Substitution processesThere are 1,,g separate HKY substitution processes, each representing a hidden state in a HMMThe kth hidden state is defined by rate matrix Mk:= nucleotide distribution of hidden state k= rate of hidden state k= transition/transversion rate ratio of hidden state kNote: Subscripts refer to observable states. Superscripts refer to hidden states

  • Temporal heterogeneity: a switching modelA reversible Markov model describing the switching rate between hidden statesThis process defined by g x g rate matrix C= exchangeability between hidden states k and l= probability of a hidden stateNote: Subscripts refer to observable states. Superscripts refer to hidden states

  • Defining a THMMThe 4g x 4g instantaneous rate matrix is:= changes between observable states i, j and hidden states k, lHidden states and observable states do not change simultaneouslyEquilibrium distribution is Note: Subscripts refer to observable states. Superscripts refer to hidden states

  • THMMs for spatial and temporal heterogeneityACGTACGTACGTAGTCAGTCAGTCRate of transitions between hidden states relative to substitution rate0.07C =Note: Value proportional to bubble area

  • Mixture models for spatial heterogeneityACGTACGTACGTAGTCAGTCAGTCState 1State 2State 3State 1State 2State 3Probability of different hidden states accounted for by the equilibrium distribution at the rootRestricting all to zero results in a mixture model

  • Mixture models for spatial heterogeneityPr( )Pr( )Pr( )

  • Investigating heterogeneity in groELDataHerbeck et al. (2005) examined groEL sequences to investigate origins of primary endosymbiontsVariability of GC content demonstrated to affect tree estimateThere are 23 sequences of length 1572 nucleotides (all 3 codon positions)

  • Investigating spatial heterogeneityUse mixture model ( set to 0)Examine 2 and 3 hidden statesRelative importance of rate ( to vary), nucleotide frequencies ( to vary), and Ts/Tv bias ( to vary)Importance of all HKY parameters varying between classesInvestigating simple temporal heterogeneitySingle extra degree of freedom over mixture modelsUse simple THMM ( set to equal)Relative importance of allowing different HKY parameters to vary temporallyInvestigating simple temporal heterogeneityGTR switching allows all to varyInvestigating heterogeneity in groEL

  • Results: groEL (no - distribution)lnL(HKY) = -18579.6Improvement in over HKY lnL(HKY+dG) = -16209.9(Improvement in AIC over HKY)

  • Its all about rate: frequenciesACGTACGTACGTAGTCAGTCAGTCRate of transitions between hidden states relative to substitution rate0.04C =

  • Results: groEL (with - distribution)lnL(HKY) = -18579.6Improvement in over HKY+ lnL(HKY+dG) = -16209.9(Improvement in AIC over HKY+)

  • THMM+ FrequenciesACGTACGTACGTAGTCAGTCAGTCRate of transitions between hidden states relative to substitution rate0.14C =ACGTACGTACGTAGTCAGTCAGTCState 1State 2State 3State 1State 2State 3

  • THMM+ All+HACGTACGTACGTAGTCAGTCAGTCRate of transitions between hidden states relative to substitution rate0.07C =ACGTACGTACGTAGTCAGTCAGTCState 1State 2State 3State 1State 2State 3

  • More results: data from PANDIT lnL(HKY) = -1 053 026.8 Improvement in over HKY+ lnL(HKY+dG) = -1 017 588.4( Improvement in AIC over HKY+)Improvement = 35 438.4

  • More evolution = more heterogeneity?All+ with GTR switching(Looks similar for dN/dS and dN)23242333465238705181644860778374

  • More evolution = more heterogeneity?Potential cause 1: Something wrong with the statisticsAIC per site relative to HKY(+) is not correcting properly for improvements given by additional branches or something elseSome kind of systematic error as tree length grows, such as tree estimate accuracy

    Potential cause 2: Something biologically interestingAs tree length grows the substitution process tends to appear more heterogeneous

  • TTCGTATime

  • TTCGTATime

  • TTCGTATime

  • TTCGTATime

  • A simple model for describing complexityDegeneracy of the genetic codeThe degeneracy of the genetic code can leads to staccato patterns of evolution, particularly at the 3rd codonPresent in nearly all analyses of nucleotide coding data

    4-fold degeneracy2-fold degeneracy1-fold degeneracyHidden biological processOther types of complexityAny sequence where biological function places restrictions on how sites change and those restrictions have the potential to vary over time

  • ConclusionsTemporal and spatial heterogeneitySpatial variation in rate masks other effectsMost complex model provides best description of data in all casesProgression to 4 hidden state models provides further improvement, but runs into numerical optimisation problems

    Biological causes of heterogeneityMay occur whenever there is biological function in sequence dataLong evolutionary times may require (even) more sophisticated modelsTHMMs could provide a simple framework for describing and drawing inferences from heterogeneity induced by complex dependencies

    The figures are increase in likelihood over HKY and increase in AIC over HKY