inference of demographic histories of natural populations...

Inferenceofdemographichistoriesofnaturalpopulationsusingsequence

Coalescence,MutationandRecombination

RichardDurbinrd109@cam.ac.ukCeskyKrumlov24/1/18

WhatImeanbydemography

1.  Populationsizegoingbackintime– Actually“effectivepopulationsize”Ne(t)

•  Wewillcomebacktowhatthismeans– Approximatetimerange10k–1Myearsago

•  Againwewillseewhy2.  Populationstructure

–  Subpopulationsandwhentheysplit(andmerged?)

•  Basedonexplicitevolutionarymodels–  Relatepatternsof(shared)geneticvariation

accumulatedsinceacommonancestortohistory

Treeontwosequences•  GustaveMalécot(1940s)

•  Coalescenceisjoiningtogether,inourcasegoingbackwardsintime

•  Chanceofcoalescencepergenerationis1/N•  TMRCAisexponentiallydistributedwithmeanN

Probabilityofobservingamutation

•  Toseeamutation,itmusthavehappenedononeofthebranchessincethecommonancestor

•  P(observedmutation)=2Tµ•  E(observeddifferencerate)=θπ =2Nµ•  Humansarediploid,soθ =4Neµ, whereNeistheeffectivepopulationsize

•  Forhumans,θπ =~0.001– 1/800–1/1200dependingonpopulation

•  HardtomeasureNeandµ independently…

Effectivepopulationsize•  Lotsofmystique/angstaboutthis

–  Ourdefinitionisarguablyatthecoreoftheconcept•  thereciprocaloftheprobabilityofsharingaparentinthepreviousgeneration

•  =1/coalescencerate•  Whythisisdifferentfromcensuspopulationsize:

– Manyconsequencesoccuroverlargenumbers(oftenorderofNegenerations)–longtermaveraging

–  Structuregeneratesnon-randompatternsofcoalescence,andnon-independencebetweengenerations

– Maybeonlyasmallpercentageofindividualsbreed–  Selectionfavourssomeindividualsoverothers

•  Butisalwayssomethingofthisformthatwegetatbypopulationgeneticanalysis

SegmentsoffixedTMRCAareseparatedbyrecombination

Mutations

Recombination in some ancestor

PairwiseSequentiallyMarkovianCoalescent

LiandDurbin(2010):Inferenceofhumanpopulationhistoryfromindividualgenomesequences

Hidden Markov Model

PSMCHiddenMarkovModel

•  e(x)=“emissionatx”=2𝜇t ifahet,else(1-2𝜇t)•  r(t|s)=prob(recombinationfromTMRCAstot)=2𝜌s prob(coalescebacktot)

+ (1-2𝜌s) ift = s

DependsonN(t’) t’ < s,t

Markovassumption

•  Thismodelassumesthatdatatotheleftofx|TMRCAatx = t

isindependentofdatatotherightofx|TMRCAatx = t

•  Forstandardmixingpopulationsthisisaverygoodassumption– SequentiallyMarkovianCoalescentapproximation,McVean&Cardin2005

PSMC-HMMreconstructsindividualhistory

•  PairwiseSequentiallyMarkovianCoalescent–HiddenMarkovModel

•  Datasimulatedusingms(Hudson)•  Modelthecoalescenttimetbye.g.50discretebins,spreadlogarithmically

Singlehumangenomewithbootstrap

Humanpopulationhistory,withNeanderthals

HengLiScaled coalescent time

AdvancessincetheoriginalPSMC

1.  UseSMC’modelwhichcorrectlyhandlesrecombinationscoalescingbacktothesameancestor(Schiffels,…)

–  Minortweaktoequations,butsignificant–  Cannowfitrecombination:mutationratio–  ImplementedinMSMC/MSMC2

2.  Timespeedup:linearnotquadraticinnumberoftimeslices(Harris,...Song,2014)

CoalesecentNe(t)reflectsancestralstructureaswellaspopulationsize

•  PSMCactuallymeasuresλ = 1/coalescence rate

•  Structurecanalsochangecoalescencerate– Li&Durbinsupplement– OlivierMazet…Chikhi

N-island model Migration between islands controls coalescent rate

Humanpopulationhistory,withNeanderthals

rise of anatomicalmodern humans 1-200kya

origin of Homo1.5-2MyaHengLi

DramaticrecentradiationsofhaplochromimecichlidsintheAfricanriftvalleygreatlakes

•  LakeMalawi~500specieswithinlast1Myears

Sofarwehavesequenced~80speciesat15-20xcoverage

Briefintroductiontoanothersystem

LakeMalawicichlidPSMCAlticorpusAulonocaraBuccochromisCopadichromisLethrinopsMylochromisNimbochromisOtopharynxPlacidochromisTremitochranus

time [years]104 105 106

(Very) approximate 4x 4x 4x

Isstructureassociatedwithspeciation?

•  Perhaps–  Ideasofhybridspeciation,reuseofallelesselectedindifferentenvironments,hybridswarmsandgeneflow

•  Butthisisanothertalk…

Mightstructurebe(partly)identifiableinthePSMCmodel?

•  TheinferredvaluesN(t)havedimensionT,thenumberoftimebins

•  ButthetransitionmatrixMhasdimensionT2•  CurrentlywederiveMfromNbytheoryassumingpanmixia–  Istherearichertheoryforstructuredpopulations?–  HowtoparameterisestructuralcomplexityS(t)attimet,withassociatedtheoryforM(N,S)

•  OrcanwefitthetransitionmatrixMunconstrained?–  Thensearchforevidenceofstructurewithinit–  Andordogoodnessoffit?

Addinganothersequence

•  Chanceofcoalescencepergenerationfromthreesequencesis3/N

•  Oncewehaveacoalescencewearebacktothesituationwithtwosequences

•  Fromisequenceschanceisi(i-1)/2N

Digression:Coalescentmodel(Kingman,1980)Adistributionontrees

•  T(i)~exponentialwithmean2N/i(i-1)

Propertiesofthecoalescent

•  Asweaddextrasequences,theyareincreasinglylikelytocoalesceveryfast,andincreasinglyunlikelytoaffectthefullTMRCA

•  Treesareveryvariable– E.g.4samples

on6leaves

Theexpectedheightofthetreeformanysamplesisonlytwicethatwithtwosamples

Relationshipbetweenforwardsintime(Wright-Fisher)andbackwardsintime(Coalescent)models

Populationevolutionforwards

Coalescenttreebackwards

ThecoalescenttreedescribesasamplefromtheforwardprocessKingmancoalescentgeneratesan“exact”samplefromWright-Fisher

Geneticvariationinasample

•  Mutationsoccuratrandomonthetree– Separationofsourcesofrandomness

•  Randomdemographytreestructurefromcoalescent•  Randomsamplingofmutationsonthetree

Watterson’stheta

LetSbethenumberofmutations=segregatingsites

E(S)~θlogn

•  Densityofmutationswithfrequencyiinasampleofnisθ/i

•  1/fdistributionofpopulationallelefrequencies

•  Populationminorallelefrequencydistributionofadifferenceobservedbetweentwosequencesisflat–  Probability(1/f).2f(1-f)=2(1-f),foldedat½is2

Distributionofvariantallelefrequencies

sitefrequencyspectrumSFS

Relaxationofassumptions(1)

•  E.g.changeinpopulationsizechangesthesitefrequencyspectrum•  Tajima’sD

– Sensitivetonumberofraremutations,sochangeinNe

–  IfDispositivethereisadeficiencyofraremutations•  Excessrecentcoalescences,recentsmallNe-selection

D>0 D<0

ThisisthebasisofSFS-baseddemographyinference

Fig. 2 The expected site frequency spectrum (SFS) of the derived allele (the new mutation arisen in the population) for three different demographic models: (i) a population that has been of constant size throughout history; (ii) a model previously fit to the derived allele

frequency spectrum of Europeans, which includes an out-of-Africa population bottleneck and a second, more recent, population bottleneck (21); and (iii) the same two-bottleneck model of

European history with the addition of recent exponential growth from a population size of 10,000 at the advent of agriculture to an extant effective population size of 10,000,000, which

amounts to 1.7% growth per generation during the last 400 generations.

A Keinan, A G Clark Science 2012;336:740-743

Individualsinhumanoutbredpopulationsstillcarrymanyvariantsnotinthelargesequencedatasets(1000Genomesetc.)

•  Exponentialpopulationgrowthinlast10,000yearsgiveslongtipstothetree

•  In“big”populations,tipsarehundredsofgenerationslong,sotensofthousandsofprivatevariantspersample,hundredsfunctional

Thisbehaviourisverydependentonpopulationstructure.

Ingeneticisolatestherecenteffectivepopulationsizeissmaller,andthetipsareshorter

Whataboutrecombination?•  Ifpointsonthegenomeareveryclose,e.g.adjacent,theysharethesametree

•  Ifpointsareveryfar,theirtreesaresampledfromthecoalescentindependently

•  Whathappensinbetween?

•  Arecombinationintheancestorofamodernsequencemadeitoutoftwoseparatesequences,onecontributingtotheleftandonetotheright

Recombinationchangesthetreeasyoumovealongthesequence

Typicallyrecombinationrateiscomparabletoorlargerthanthemutationrate:both~10-8/bp/geninhumanSo“genetree”varieseverysiteinmixingpopulations

AncestralRecombinationGraph(ARG)

•  TheAncestralRecombinationGraphdescribesthewaythatindividualsequencesinapopulationarerelated–  Atalocus,sequencesarerelatedbyatree–  Ancestralrecombinationschangethetreeasyoumovealongthechromosome

a ..C..G..A..b ..T..G..C..c ..T..A..A..d ..T..A..C..

0 0 01 0 11 1 01 1 1

a b c d a b c da b c d

a b c d

ARG “Pruneandgraft”operationgoinglefttoright

Coalescentwithrecombination

•  ARGisastructure(datatype)•  TheprobabilitydistributionoverARGsthatariseswhenrecombinationisaddedtothestandard(Wright-Fisher)modeliscalledtheCoalescentwithRecombination– Hudson’smssoftwareistheclassicsimulator– NewmsprimefromJeromeKelleherMUCHfaster

•  Nowtwopossibleeventsgoingbackwardsintime–  Coalescence:whichmergestwosequences

•  Forisequences,rateisi(i-1)/2N–  Recombination:whichsplitsasequenceintotwo

•  Forisequences,rateisiLρ

Extendingtomultiplesequences

•  Therecenttimelimitof~20kyaforPSMCissetbecausewerunoutofrecentcoalescencesbetweentwohaplotypes

•  Ifweaddmorehaplotypes,thentherearemorerecentcoalescencesandwecouldlookatmorerecenthistory

•  But,…thehiddenstateisthenatree(withbranchlengths):impracticaltomodelfully– MCMCisnotoriouslydifficult

Option1:Firstcoalescenceofonesequencetothetreeoftheothers

•  ThisisrelatedtotheLiandStephensmodel(orStephensandDonnelly)–chromopainter

•  M(t)isarandomvariable,andweneedtheentirehistoryofM(t)tocalculatetransitionprobabilitiesq

•  Hugeincreaseinstatespaceand/orthisbreaksMarkovassumptions

Problem:Coalescenceofchosensequencetotheothersdependsonthenumberoflineages

M(t)remainingattimet

MCMCapproach:ARGweaver•  Repeatedlyremoveasequence*andadditback,samplingconditionalonremainingARG

•  HMM:samplewithforward-backwardalgorithm

Genome-wideinferenceofancestralrecombinationgraphsRasmussenMD,HubiszMJ,GronauI,SiepelA.PLoSGenet.10:e1004342(2014)

•  Costly–useforinferencegivenhistory

Option2:firstcoalescencebetweenanypair

•  Thisremains(approximately)Markov•  StatespaceisO(M2T)–pairofstatesandtimetheycoalesce– ButtransitionupdatesareonlyO(M2T2),becausetransitionsarememoryless

•  EmissionsfromXijaresingletonsoniorj– Non-singletonsthatarediscrepantbetweeniandjwipeoutdensityatXij

StephanSchiffelsandDurbin(NatureGenetics,2015)

MSMCcanfitbothpopulationsizehistoryandseparationhistory

•  Separationviathe(scaled)ratioofcoalescencebetweenandwithinpopulations

Accessmorerecenthistory

Uselowermutationratehere~0.5x10-9/year

Divergencebetweenpopulations

FirstCoalescencewithinPopulation2

FirstCoalescencewithinPopulation1

FirstCoalescenceacrossbothpopulations

• MSMCcaninferseparatecoalescencerateswithinandbetweenpopulations

• Givenrateswithinpopulations,λ11(t)andλ22(t),andacrosspopulations,λ12(t),computerelativegeneflowasratio

λ12(t) [λ11(t)+ λ22(t)] / 2

m(t) =

•  Idea:Inferseparatecoalescencerateswithinandbetweenpopulations:

Testinggeneflowinferencewithsimulatedsplit

☜m(t)=1:perfectlymixed

☜m(t)=0:perfectlysplit

4haplotypes:goodforsplits50-200kya.8haplotypes:goodforsplits5-50kya.

Separationhistory

AlternativestoMSMC

•  MSMC2(Schiffels:inMalaspinas2016/unpub.)– RunPSMC’onallpairsofsequencesindependently– Multiplythelikelihoods–Compositelikelihood

•  Assumesthepairsareindependent,whichisfalse•  Butgivesunbiasedestimation(thoughoverconfident)

•  SMC++(Terhorst,Kamm,Song:NatGen2017)– Pair,withp(het|othersequences)– Verycool–worksevenongenotypedata!– ButthereareapproximationproblemsanalogoustothoseinMSMC–notapanacea

Usingrarevariantstoinferdemographichistory

•  Rarevariantscontaininformationaboutrecentpopulationhistoryandstructure

•  Shownhere:numberofdoubletonssharedamongEuropeansamples

•  Wewouldliketoestimatepopulationsplittimesandpopulationsizesfromthefrequencyofrarevariants

CEU FIN GBR IBS TSI

[1000GenomesProject,Phase3]

ComparetoChromoPainterdata

AncientsamplesfromHinxton

12884A,Ironage

12883A,Saxon

12880A,Ironage

12881A,Saxon12885A,Saxon

MoresamplesfromLinton/Oakington

Sharingpatternsbetweenancientandmodernsamples

•  DifferencebetweenAnglo-SaxonandIronAgesharingwithNEDandIBSconsistentacrossdifferentAlleleCounts

•  SmallbutsignificantdifferencesalsowithinmodernBritain(UK10K):SamplesfromWalesandScotlandsharefewerrarevariantswithDutchpeople

EstimatesofAnglo-SaxoncontributiontomodernBritishgenomes

•  Suggests~30%SaxoncontributiontosamplesinEastofEngland,and~20%toUK10KsamplesfromWalesandScotland

•  Consistentwith20-40%indirectestimatefromPOBI(PeoplesoftheBritishIsles)study

Therareallelecoalescent• Goal:Estimatedemographichistory(populationsizesandsplittimes)fromrarevariants

•  Computelikelihoodofdemographicmodelgivenadistributionofrarevariants

RareCoalmodel

•  Idea:Definerecursionequationsforprobabilityofobservingiderivedallelesinpopulationk:

•  Givenademographicmodel,propagatethisprobabilitybackwardsintimetogetfulllikelihoodofthedata.

•  Keysimplification:Treatnumberofancestralallelesovertimeasaverage(mean-fieldapproximation):

Testinferencewithsimulateddata

Time(generations)

10 5 2 4 5

1scaledpopulationsizes

10.3 5 2 4 5

Simulated Estimated

Fits(100samplesperpop.):Pattern Count(real) Count(predicted)

0,1,0,2,1 1114 1159

2,1,0,0,0 140585 139657

1,0,2,0,0 1138 1205

thousandsofrows…

Fittingpopulationsizesandsplittimesseparatesdriftfromdivergence->differentfromTreemix,qpGraphetc.

EuropeanTree(Fits)

Placingancientsamplesonthetree

•  PlotsshowthelikelihoodformergingthepopulationN=1sampleontothetreeasaheatmap

Moredirectcalculationofthelikelihoodofthejointsitefrequency

spectrumwithmomi

JackKamm…Song2016,andunpublished

•  ComplexityofancestralallelestateisreducedbyusingMoranmodel

•  UseAutomaticDifferentiationtocalculategradientstomaximiselikelihoodoverdemographywith(limited)geneflow

MomiappliedtocentralAsiandata

•  Includeancientsamples–  Conditionascertainmentonmodern/deepsamples

•  Totalbranchlengthonthese

–  Randomallelesamplingforlowcoveragesamples

•  Estimatessplittimes•  Bootstrapforconfidenceintervals–  Butbewaremodelmisspecification

withPeterdeBarrosDamgaard,RuiMartiniano,MartinSikora,EskeWillerslevetal.

Momicalculations

•  TocalculateP(x1,x2,x3,…)– SetleavestoIndicator(xi),e.g.[0,0,1,0…0] forxi=2– Propagatelikelihoodsuptree(“tree-peeling”)

•  Cancorrespondinglycalculatetheexpectationofanymulti-linearfunctionofallelecounts– 𝔼[f1(x1)f2(x2)f3(x3)…]

•  bysettingleafito[fi(0), fi(1), . . . , fi(n1)]– Worksbecausepropagationislinear

Examples

•  Totalbranchlength∝chanceofanymutation–  fi(j) = 1, vectoris[1,1,1…1]

•  TMRCAforpopi(i arbitraryunlessancientmodel)–  fi(j) = j/ni , vectoris[0,0.2,0.4,0.6,0.8,1] for ni=5–  fk(j) = 1, k ≠ i

•  f3 = 𝔼[(X1-X3)(X2-X3)], f4 = 𝔼[(X1-X2)(X3-X4)]–  Requirestermssuchas𝔼[X1X2]forwhich–  f1(j) = j/n1, f2(j) = j/n2, fk(j) = 1, k > 2

•  Alsonumerators,denominatorsofFST,Tajima’sD

Summary

•  PSMC(‘)estimatesdemographyfromasinglepairofsequences– Samplesizeisinlengthnotnumber– Quiteacleanmodel– Majorissueispopulationstructure

•  MSMC,MSMC2,SMC++useadditionalsamplestogetatmorerecenttimes

•  RareCoal/MomiusecoalescentmodellingoftheSFSonmoresamplestoestimatetrees– WithlimitedmodelledgeneflowforMomi

Experimentaldesign

•  (Sequence)datacollectioncostsmoney•  Wealwaysneedtomakedecisionsinhowtosampleandsequence– Numberofsamples– Numberofpopulations– Depthofsequencing– WholeGenomeShotgunorRADseqorExomes…

1000GenomesProject

•  Pilot(averylongtimeago!)– 2triosathighdepth30x

•  Phasing,accuratesingle-samplegenotypecalling,mutationrates

– 3populationsx60samplesatlowdepth2-4x+exomes

•  Mainproject– 26populationsof~100(2504total)at6-8x(+exomes)–  (150triosathighdepth–butwhoremembersthem?)

Malawicichlidsequencing

•  Phase1– Threetriosat30x:mutationrateestimation,controls– ~70speciesat15-20x,additionalsamplesforsomeat8-12x

•  Phase2– 7setsof20at15x– Morespecies– Somesetsof24or48toaddressspecificquestions

•  MassokoGWAS(Turner)– 200samplesat4x+100samplesforreplication

Lowcoveragesequencingstrategy

•  Typicallyoneneedstosequenceat~30xdepthtofind(almost)allvariantsinasample

•  Tofindlowfrequencyvariantswewanttosequencemanysamples

•  Spreadsequenceacrossmoresamples

Phase1powerandgenotypingaccuracy

SNPdetection Genotypingaccuracy

HyunMin-Kang(UMichigan)

Callingfromlowcoveragesequence

•  Multi-samplecallsiteswithsamtoolsorGATK•  Obtaingenotypelikelihoodsateachsiteineachsame(alsosamtoolsorGATK)– Likelihood=P(data|genotype)

•  CombineinanimputationframeworkusingBEAGLE(Browning),orMINIMAC(Abecasis),orperhapsSTITCH(Mott)?

•  PhaseusingSHAPEIT2(Marchini)orEAGLE2(Loh)

Sequencingdepth•  30xisstandardfornear-completeaccuracy

–  Sufficienttoestimatemutationratesintrios(needseveraltriosformostspecies)

•  15xisgoodenoughforSNPs(~97%),notquitesogoodforindels(perhaps90-95%)

•  4-8xgivesgoodlowcoverageimputationasinpreviousslides

•  Peoplehaveused1-2x,butthisishardwork…•  60x+isnecessaryforsubclonalstructure,e.g.cancer,highploidy

•  Inacross,sequencethefounderstohighdepth,andtheF2/F3tolowdepth(1xorlessisfine)andimputeusingSTITCHorotherRichardMotttools

inference of demographic histories of natural populations...

Documents

demographic inference and representative …demographic...

nuclear archaeology: reconstructing reactor histories from...

rewriting histories

making histories, sharing histories: community-based...

asia histories

connected histories

entangled histories

full likelihood inference from the site frequency spectrum...

multivocal histories

robust demographic inference from genomic and snp data...

forgotten marriages? measuring the reliability of...

corporate histories

anand bhaskar in - university of california, berkeley ·...

revisiting histories

architectural histories

life histories

genomic differentiation and demographic histories of...

blockade histories

mair, colette (2012) methods for demographic inference...

moving histories