inference of demographic histories of natural populations...
Post on 18-Jul-2020
3 Views
Preview:
TRANSCRIPT
Inferenceofdemographichistoriesofnaturalpopulationsusingsequence
data
Coalescence,MutationandRecombination
RichardDurbinrd109@cam.ac.ukCeskyKrumlov24/1/18
WhatImeanbydemography
1. Populationsizegoingbackintime– Actually“effectivepopulationsize”Ne(t)
• Wewillcomebacktowhatthismeans– Approximatetimerange10k–1Myearsago
• Againwewillseewhy2. Populationstructure
– Subpopulationsandwhentheysplit(andmerged?)
• Basedonexplicitevolutionarymodels– Relatepatternsof(shared)geneticvariation
accumulatedsinceacommonancestortohistory
Treeontwosequences• GustaveMalécot(1940s)
• Coalescenceisjoiningtogether,inourcasegoingbackwardsintime
• Chanceofcoalescencepergenerationis1/N• TMRCAisexponentiallydistributedwithmeanN
Probabilityofobservingamutation
• Toseeamutation,itmusthavehappenedononeofthebranchessincethecommonancestor
• P(observedmutation)=2Tµ• E(observeddifferencerate)=θπ =2Nµ• Humansarediploid,soθ =4Neµ, whereNeistheeffectivepopulationsize
• Forhumans,θπ =~0.001– 1/800–1/1200dependingonpopulation
• HardtomeasureNeandµ independently…
Effectivepopulationsize• Lotsofmystique/angstaboutthis
– Ourdefinitionisarguablyatthecoreoftheconcept• thereciprocaloftheprobabilityofsharingaparentinthepreviousgeneration
• =1/coalescencerate• Whythisisdifferentfromcensuspopulationsize:
– Manyconsequencesoccuroverlargenumbers(oftenorderofNegenerations)–longtermaveraging
– Structuregeneratesnon-randompatternsofcoalescence,andnon-independencebetweengenerations
– Maybeonlyasmallpercentageofindividualsbreed– Selectionfavourssomeindividualsoverothers
• Butisalwayssomethingofthisformthatwegetatbypopulationgeneticanalysis
SegmentsoffixedTMRCAareseparatedbyrecombination
Past
Mutations
Recombination in some ancestor
PairwiseSequentiallyMarkovianCoalescent
LiandDurbin(2010):Inferenceofhumanpopulationhistoryfromindividualgenomesequences
Hidden Markov Model
PSMCHiddenMarkovModel
• Movefromlefttorightinthegenome– LetP(x|t)=prob(datauptox|TMRCAatx = t)– CalculateP(x+1|t)=(∑sP(x|s) r(t|s))e(x)
• e(x)=“emissionatx”=2𝜇t ifahet,else(1-2𝜇t)• r(t|s)=prob(recombinationfromTMRCAstot)=2𝜌s prob(coalescebacktot)
+ (1-2𝜌s) ift = s
DependsonN(t’) t’ < s,t
Markovassumption
• Thismodelassumesthatdatatotheleftofx|TMRCAatx = t
isindependentofdatatotherightofx|TMRCAatx = t
• Forstandardmixingpopulationsthisisaverygoodassumption– SequentiallyMarkovianCoalescentapproximation,McVean&Cardin2005
PSMC-HMMreconstructsindividualhistory
• PairwiseSequentiallyMarkovianCoalescent–HiddenMarkovModel
• Datasimulatedusingms(Hudson)• Modelthecoalescenttimetbye.g.50discretebins,spreadlogarithmically
Singlehumangenomewithbootstrap
Humanpopulationhistory,withNeanderthals
HengLiScaled coalescent time
AdvancessincetheoriginalPSMC
1. UseSMC’modelwhichcorrectlyhandlesrecombinationscoalescingbacktothesameancestor(Schiffels,…)
– Minortweaktoequations,butsignificant– Cannowfitrecombination:mutationratio– ImplementedinMSMC/MSMC2
2. Timespeedup:linearnotquadraticinnumberoftimeslices(Harris,...Song,2014)
CoalesecentNe(t)reflectsancestralstructureaswellaspopulationsize
• PSMCactuallymeasuresλ = 1/coalescence rate
• Structurecanalsochangecoalescencerate– Li&Durbinsupplement– OlivierMazet…Chikhi
N-island model Migration between islands controls coalescent rate
Humanpopulationhistory,withNeanderthals
rise of anatomicalmodern humans 1-200kya
origin of Homo1.5-2MyaHengLi
DramaticrecentradiationsofhaplochromimecichlidsintheAfricanriftvalleygreatlakes
• LakeMalawi~500specieswithinlast1Myears
Sofarwehavesequenced~80speciesat15-20xcoverage
Briefintroductiontoanothersystem
LakeMalawicichlidPSMCAlticorpusAulonocaraBuccochromisCopadichromisLethrinopsMylochromisNimbochromisOtopharynxPlacidochromisTremitochranus
popu
latio
n siz
e
104
105
106
107
108
time [years]104 105 106
(Very) approximate 4x 4x 4x
Isstructureassociatedwithspeciation?
• Perhaps– Ideasofhybridspeciation,reuseofallelesselectedindifferentenvironments,hybridswarmsandgeneflow
• Butthisisanothertalk…
Mightstructurebe(partly)identifiableinthePSMCmodel?
• TheinferredvaluesN(t)havedimensionT,thenumberoftimebins
• ButthetransitionmatrixMhasdimensionT2• CurrentlywederiveMfromNbytheoryassumingpanmixia– Istherearichertheoryforstructuredpopulations?– HowtoparameterisestructuralcomplexityS(t)attimet,withassociatedtheoryforM(N,S)
• OrcanwefitthetransitionmatrixMunconstrained?– Thensearchforevidenceofstructurewithinit– Andordogoodnessoffit?
Addinganothersequence
• Chanceofcoalescencepergenerationfromthreesequencesis3/N
• Oncewehaveacoalescencewearebacktothesituationwithtwosequences
• Fromisequenceschanceisi(i-1)/2N
Digression:Coalescentmodel(Kingman,1980)Adistributionontrees
• T(i)~exponentialwithmean2N/i(i-1)
Propertiesofthecoalescent
• Asweaddextrasequences,theyareincreasinglylikelytocoalesceveryfast,andincreasinglyunlikelytoaffectthefullTMRCA
• Treesareveryvariable– E.g.4samples
on6leaves
Theexpectedheightofthetreeformanysamplesisonlytwicethatwithtwosamples
Relationshipbetweenforwardsintime(Wright-Fisher)andbackwardsintime(Coalescent)models
Populationevolutionforwards
Coalescenttreebackwards
ThecoalescenttreedescribesasamplefromtheforwardprocessKingmancoalescentgeneratesan“exact”samplefromWright-Fisher
Geneticvariationinasample
• Mutationsoccuratrandomonthetree– Separationofsourcesofrandomness
• Randomdemographytreestructurefromcoalescent• Randomsamplingofmutationsonthetree
Watterson’stheta
LetSbethenumberofmutations=segregatingsites
E(S)~θlogn
• Densityofmutationswithfrequencyiinasampleofnisθ/i
• 1/fdistributionofpopulationallelefrequencies
• Populationminorallelefrequencydistributionofadifferenceobservedbetweentwosequencesisflat– Probability(1/f).2f(1-f)=2(1-f),foldedat½is2
Distributionofvariantallelefrequencies
sitefrequencyspectrumSFS
Relaxationofassumptions(1)
• E.g.changeinpopulationsizechangesthesitefrequencyspectrum• Tajima’sD
– Sensitivetonumberofraremutations,sochangeinNe
– IfDispositivethereisadeficiencyofraremutations• Excessrecentcoalescences,recentsmallNe-selection
D>0 D<0
ThisisthebasisofSFS-baseddemographyinference
Fig. 2 The expected site frequency spectrum (SFS) of the derived allele (the new mutation arisen in the population) for three different demographic models: (i) a population that has been of constant size throughout history; (ii) a model previously fit to the derived allele
frequency spectrum of Europeans, which includes an out-of-Africa population bottleneck and a second, more recent, population bottleneck (21); and (iii) the same two-bottleneck model of
European history with the addition of recent exponential growth from a population size of 10,000 at the advent of agriculture to an extant effective population size of 10,000,000, which
amounts to 1.7% growth per generation during the last 400 generations.
A Keinan, A G Clark Science 2012;336:740-743
Individualsinhumanoutbredpopulationsstillcarrymanyvariantsnotinthelargesequencedatasets(1000Genomesetc.)
• Exponentialpopulationgrowthinlast10,000yearsgiveslongtipstothetree
• In“big”populations,tipsarehundredsofgenerationslong,sotensofthousandsofprivatevariantspersample,hundredsfunctional
Thisbehaviourisverydependentonpopulationstructure.
Ingeneticisolatestherecenteffectivepopulationsizeissmaller,andthetipsareshorter
Whataboutrecombination?• Ifpointsonthegenomeareveryclose,e.g.adjacent,theysharethesametree
• Ifpointsareveryfar,theirtreesaresampledfromthecoalescentindependently
• Whathappensinbetween?
• Arecombinationintheancestorofamodernsequencemadeitoutoftwoseparatesequences,onecontributingtotheleftandonetotheright
Recombinationchangesthetreeasyoumovealongthesequence
Typicallyrecombinationrateiscomparabletoorlargerthanthemutationrate:both~10-8/bp/geninhumanSo“genetree”varieseverysiteinmixingpopulations
AncestralRecombinationGraph(ARG)
• TheAncestralRecombinationGraphdescribesthewaythatindividualsequencesinapopulationarerelated– Atalocus,sequencesarerelatedbyatree– Ancestralrecombinationschangethetreeasyoumovealongthechromosome
a ..C..G..A..b ..T..G..C..c ..T..A..A..d ..T..A..C..
0 0 01 0 11 1 01 1 1
a b c d a b c da b c d
1
a b c d
2 3
R
ARG “Pruneandgraft”operationgoinglefttoright
Coalescentwithrecombination
• ARGisastructure(datatype)• TheprobabilitydistributionoverARGsthatariseswhenrecombinationisaddedtothestandard(Wright-Fisher)modeliscalledtheCoalescentwithRecombination– Hudson’smssoftwareistheclassicsimulator– NewmsprimefromJeromeKelleherMUCHfaster
• Nowtwopossibleeventsgoingbackwardsintime– Coalescence:whichmergestwosequences
• Forisequences,rateisi(i-1)/2N– Recombination:whichsplitsasequenceintotwo
• Forisequences,rateisiLρ
Extendingtomultiplesequences
• Therecenttimelimitof~20kyaforPSMCissetbecausewerunoutofrecentcoalescencesbetweentwohaplotypes
• Ifweaddmorehaplotypes,thentherearemorerecentcoalescencesandwecouldlookatmorerecenthistory
• But,…thehiddenstateisthenatree(withbranchlengths):impracticaltomodelfully– MCMCisnotoriouslydifficult
Option1:Firstcoalescenceofonesequencetothetreeoftheothers
• ThisisrelatedtotheLiandStephensmodel(orStephensandDonnelly)–chromopainter
t1
t2
t3
t4
t1
t2
t3
t4
• M(t)isarandomvariable,andweneedtheentirehistoryofM(t)tocalculatetransitionprobabilitiesq
• Hugeincreaseinstatespaceand/orthisbreaksMarkovassumptions
t1
t2
t3
t4
Problem:Coalescenceofchosensequencetotheothersdependsonthenumberoflineages
M(t)remainingattimet
MCMCapproach:ARGweaver• Repeatedlyremoveasequence*andadditback,samplingconditionalonremainingARG
• HMM:samplewithforward-backwardalgorithm
Genome-wideinferenceofancestralrecombinationgraphsRasmussenMD,HubiszMJ,GronauI,SiepelA.PLoSGenet.10:e1004342(2014)
• Costly–useforinferencegivenhistory
Option2:firstcoalescencebetweenanypair
• Thisremains(approximately)Markov• StatespaceisO(M2T)–pairofstatesandtimetheycoalesce– ButtransitionupdatesareonlyO(M2T2),becausetransitionsarememoryless
• EmissionsfromXijaresingletonsoniorj– Non-singletonsthatarediscrepantbetweeniandjwipeoutdensityatXij
MSMC
StephanSchiffelsandDurbin(NatureGenetics,2015)
MSMCcanfitbothpopulationsizehistoryandseparationhistory
• Separationviathe(scaled)ratioofcoalescencebetweenandwithinpopulations
Accessmorerecenthistory
Uselowermutationratehere~0.5x10-9/year
Divergencebetweenpopulations
FirstCoalescencewithinPopulation2
FirstCoalescencewithinPopulation1
FirstCoalescenceacrossbothpopulations
• MSMCcaninferseparatecoalescencerateswithinandbetweenpopulations
• Givenrateswithinpopulations,λ11(t)andλ22(t),andacrosspopulations,λ12(t),computerelativegeneflowasratio
λ12(t) [λ11(t)+ λ22(t)] / 2
m(t) =
• Idea:Inferseparatecoalescencerateswithinandbetweenpopulations:
Testinggeneflowinferencewithsimulatedsplit
☜m(t)=1:perfectlymixed
☜m(t)=0:perfectlysplit
4haplotypes:goodforsplits50-200kya.8haplotypes:goodforsplits5-50kya.
Separationhistory
AlternativestoMSMC
• MSMC2(Schiffels:inMalaspinas2016/unpub.)– RunPSMC’onallpairsofsequencesindependently– Multiplythelikelihoods–Compositelikelihood
• Assumesthepairsareindependent,whichisfalse• Butgivesunbiasedestimation(thoughoverconfident)
• SMC++(Terhorst,Kamm,Song:NatGen2017)– Pair,withp(het|othersequences)– Verycool–worksevenongenotypedata!– ButthereareapproximationproblemsanalogoustothoseinMSMC–notapanacea
Usingrarevariantstoinferdemographichistory
• Rarevariantscontaininformationaboutrecentpopulationhistoryandstructure
• Shownhere:numberofdoubletonssharedamongEuropeansamples
• Wewouldliketoestimatepopulationsplittimesandpopulationsizesfromthefrequencyofrarevariants
CEU FIN GBR IBS TSI
[1000GenomesProject,Phase3]
ComparetoChromoPainterdata
AncientsamplesfromHinxton
12884A,Ironage
12883A,Saxon
12880A,Ironage
12881A,Saxon12885A,Saxon
MoresamplesfromLinton/Oakington
Sharingpatternsbetweenancientandmodernsamples
• DifferencebetweenAnglo-SaxonandIronAgesharingwithNEDandIBSconsistentacrossdifferentAlleleCounts
• SmallbutsignificantdifferencesalsowithinmodernBritain(UK10K):SamplesfromWalesandScotlandsharefewerrarevariantswithDutchpeople
EstimatesofAnglo-SaxoncontributiontomodernBritishgenomes
• Suggests~30%SaxoncontributiontosamplesinEastofEngland,and~20%toUK10KsamplesfromWalesandScotland
• Consistentwith20-40%indirectestimatefromPOBI(PeoplesoftheBritishIsles)study
Therareallelecoalescent• Goal:Estimatedemographichistory(populationsizesandsplittimes)fromrarevariants
• Computelikelihoodofdemographicmodelgivenadistributionofrarevariants
RareCoalmodel
• Idea:Definerecursionequationsforprobabilityofobservingiderivedallelesinpopulationk:
• Givenademographicmodel,propagatethisprobabilitybackwardsintimetogetfulllikelihoodofthedata.
• Keysimplification:Treatnumberofancestralallelesovertimeasaverage(mean-fieldapproximation):
Testinferencewithsimulateddata
100
200
300
400
Time(generations)
10 5 2 4 5
0.4
0.5
0.2
1scaledpopulationsizes
92
200
299
388
10.3 5 2 4 5
0.47
0.5
0.22
0.87
Simulated Estimated
Fits(100samplesperpop.):Pattern Count(real) Count(predicted)
0,1,0,2,1 1114 1159
2,1,0,0,0 140585 139657
1,0,2,0,0 1138 1205
thousandsofrows…
Fittingpopulationsizesandsplittimesseparatesdriftfromdivergence->differentfromTreemix,qpGraphetc.
EuropeanTree(Fits)
Placingancientsamplesonthetree
• PlotsshowthelikelihoodformergingthepopulationN=1sampleontothetreeasaheatmap
Moredirectcalculationofthelikelihoodofthejointsitefrequency
spectrumwithmomi
JackKamm…Song2016,andunpublished
• ComplexityofancestralallelestateisreducedbyusingMoranmodel
• UseAutomaticDifferentiationtocalculategradientstomaximiselikelihoodoverdemographywith(limited)geneflow
MomiappliedtocentralAsiandata
• Includeancientsamples– Conditionascertainmentonmodern/deepsamples
• Totalbranchlengthonthese
– Randomallelesamplingforlowcoveragesamples
• Estimatessplittimes• Bootstrapforconfidenceintervals– Butbewaremodelmisspecification
withPeterdeBarrosDamgaard,RuiMartiniano,MartinSikora,EskeWillerslevetal.
Momicalculations
• TocalculateP(x1,x2,x3,…)– SetleavestoIndicator(xi),e.g.[0,0,1,0…0] forxi=2– Propagatelikelihoodsuptree(“tree-peeling”)
• Cancorrespondinglycalculatetheexpectationofanymulti-linearfunctionofallelecounts– 𝔼[f1(x1)f2(x2)f3(x3)…]
• bysettingleafito[fi(0), fi(1), . . . , fi(n1)]– Worksbecausepropagationislinear
Examples
• Totalbranchlength∝chanceofanymutation– fi(j) = 1, vectoris[1,1,1…1]
• TMRCAforpopi(i arbitraryunlessancientmodel)– fi(j) = j/ni , vectoris[0,0.2,0.4,0.6,0.8,1] for ni=5– fk(j) = 1, k ≠ i
• f3 = 𝔼[(X1-X3)(X2-X3)], f4 = 𝔼[(X1-X2)(X3-X4)]– Requirestermssuchas𝔼[X1X2]forwhich– f1(j) = j/n1, f2(j) = j/n2, fk(j) = 1, k > 2
• Alsonumerators,denominatorsofFST,Tajima’sD
Summary
• PSMC(‘)estimatesdemographyfromasinglepairofsequences– Samplesizeisinlengthnotnumber– Quiteacleanmodel– Majorissueispopulationstructure
• MSMC,MSMC2,SMC++useadditionalsamplestogetatmorerecenttimes
• RareCoal/MomiusecoalescentmodellingoftheSFSonmoresamplestoestimatetrees– WithlimitedmodelledgeneflowforMomi
Experimentaldesign
• (Sequence)datacollectioncostsmoney• Wealwaysneedtomakedecisionsinhowtosampleandsequence– Numberofsamples– Numberofpopulations– Depthofsequencing– WholeGenomeShotgunorRADseqorExomes…
1000GenomesProject
• Pilot(averylongtimeago!)– 2triosathighdepth30x
• Phasing,accuratesingle-samplegenotypecalling,mutationrates
– 3populationsx60samplesatlowdepth2-4x+exomes
• Mainproject– 26populationsof~100(2504total)at6-8x(+exomes)– (150triosathighdepth–butwhoremembersthem?)
Malawicichlidsequencing
• Phase1– Threetriosat30x:mutationrateestimation,controls– ~70speciesat15-20x,additionalsamplesforsomeat8-12x
• Phase2– 7setsof20at15x– Morespecies– Somesetsof24or48toaddressspecificquestions
• MassokoGWAS(Turner)– 200samplesat4x+100samplesforreplication
Lowcoveragesequencingstrategy
• Typicallyoneneedstosequenceat~30xdepthtofind(almost)allvariantsinasample
• Tofindlowfrequencyvariantswewanttosequencemanysamples
• Spreadsequenceacrossmoresamples
Phase1powerandgenotypingaccuracy
SNPdetection Genotypingaccuracy
HyunMin-Kang(UMichigan)
Callingfromlowcoveragesequence
• Multi-samplecallsiteswithsamtoolsorGATK• Obtaingenotypelikelihoodsateachsiteineachsame(alsosamtoolsorGATK)– Likelihood=P(data|genotype)
• CombineinanimputationframeworkusingBEAGLE(Browning),orMINIMAC(Abecasis),orperhapsSTITCH(Mott)?
• PhaseusingSHAPEIT2(Marchini)orEAGLE2(Loh)
Sequencingdepth• 30xisstandardfornear-completeaccuracy
– Sufficienttoestimatemutationratesintrios(needseveraltriosformostspecies)
• 15xisgoodenoughforSNPs(~97%),notquitesogoodforindels(perhaps90-95%)
• 4-8xgivesgoodlowcoverageimputationasinpreviousslides
• Peoplehaveused1-2x,butthisishardwork…• 60x+isnecessaryforsubclonalstructure,e.g.cancer,highploidy
• Inacross,sequencethefounderstohighdepth,andtheF2/F3tolowdepth(1xorlessisfine)andimputeusingSTITCHorotherRichardMotttools
top related