inference of demographic histories of natural populations...

Post on 18-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Inferenceofdemographichistoriesofnaturalpopulationsusingsequence

data

Coalescence,MutationandRecombination

RichardDurbinrd109@cam.ac.ukCeskyKrumlov24/1/18

WhatImeanbydemography

1.  Populationsizegoingbackintime– Actually“effectivepopulationsize”Ne(t)

•  Wewillcomebacktowhatthismeans– Approximatetimerange10k–1Myearsago

•  Againwewillseewhy2.  Populationstructure

–  Subpopulationsandwhentheysplit(andmerged?)

•  Basedonexplicitevolutionarymodels–  Relatepatternsof(shared)geneticvariation

accumulatedsinceacommonancestortohistory

Treeontwosequences•  GustaveMalécot(1940s)

•  Coalescenceisjoiningtogether,inourcasegoingbackwardsintime

•  Chanceofcoalescencepergenerationis1/N•  TMRCAisexponentiallydistributedwithmeanN

Probabilityofobservingamutation

•  Toseeamutation,itmusthavehappenedononeofthebranchessincethecommonancestor

•  P(observedmutation)=2Tµ•  E(observeddifferencerate)=θπ =2Nµ•  Humansarediploid,soθ =4Neµ, whereNeistheeffectivepopulationsize

•  Forhumans,θπ =~0.001– 1/800–1/1200dependingonpopulation

•  HardtomeasureNeandµ independently…

Effectivepopulationsize•  Lotsofmystique/angstaboutthis

–  Ourdefinitionisarguablyatthecoreoftheconcept•  thereciprocaloftheprobabilityofsharingaparentinthepreviousgeneration

•  =1/coalescencerate•  Whythisisdifferentfromcensuspopulationsize:

– Manyconsequencesoccuroverlargenumbers(oftenorderofNegenerations)–longtermaveraging

–  Structuregeneratesnon-randompatternsofcoalescence,andnon-independencebetweengenerations

– Maybeonlyasmallpercentageofindividualsbreed–  Selectionfavourssomeindividualsoverothers

•  Butisalwayssomethingofthisformthatwegetatbypopulationgeneticanalysis

SegmentsoffixedTMRCAareseparatedbyrecombination

Past

Mutations

Recombination in some ancestor

PairwiseSequentiallyMarkovianCoalescent

LiandDurbin(2010):Inferenceofhumanpopulationhistoryfromindividualgenomesequences

Hidden Markov Model

PSMCHiddenMarkovModel

•  Movefromlefttorightinthegenome– LetP(x|t)=prob(datauptox|TMRCAatx = t)– CalculateP(x+1|t)=(∑sP(x|s) r(t|s))e(x)

•  e(x)=“emissionatx”=2𝜇t ifahet,else(1-2𝜇t)•  r(t|s)=prob(recombinationfromTMRCAstot)=2𝜌s prob(coalescebacktot)

+ (1-2𝜌s) ift = s

DependsonN(t’) t’ < s,t

Markovassumption

•  Thismodelassumesthatdatatotheleftofx|TMRCAatx = t

isindependentofdatatotherightofx|TMRCAatx = t

•  Forstandardmixingpopulationsthisisaverygoodassumption– SequentiallyMarkovianCoalescentapproximation,McVean&Cardin2005

PSMC-HMMreconstructsindividualhistory

•  PairwiseSequentiallyMarkovianCoalescent–HiddenMarkovModel

•  Datasimulatedusingms(Hudson)•  Modelthecoalescenttimetbye.g.50discretebins,spreadlogarithmically

Singlehumangenomewithbootstrap

Humanpopulationhistory,withNeanderthals

HengLiScaled coalescent time

AdvancessincetheoriginalPSMC

1.  UseSMC’modelwhichcorrectlyhandlesrecombinationscoalescingbacktothesameancestor(Schiffels,…)

–  Minortweaktoequations,butsignificant–  Cannowfitrecombination:mutationratio–  ImplementedinMSMC/MSMC2

2.  Timespeedup:linearnotquadraticinnumberoftimeslices(Harris,...Song,2014)

CoalesecentNe(t)reflectsancestralstructureaswellaspopulationsize

•  PSMCactuallymeasuresλ = 1/coalescence rate

•  Structurecanalsochangecoalescencerate– Li&Durbinsupplement– OlivierMazet…Chikhi

N-island model Migration between islands controls coalescent rate

Humanpopulationhistory,withNeanderthals

rise of anatomicalmodern humans 1-200kya

origin of Homo1.5-2MyaHengLi

DramaticrecentradiationsofhaplochromimecichlidsintheAfricanriftvalleygreatlakes

•  LakeMalawi~500specieswithinlast1Myears

Sofarwehavesequenced~80speciesat15-20xcoverage

Briefintroductiontoanothersystem

LakeMalawicichlidPSMCAlticorpusAulonocaraBuccochromisCopadichromisLethrinopsMylochromisNimbochromisOtopharynxPlacidochromisTremitochranus

popu

latio

n siz

e

104

105

106

107

108

time [years]104 105 106

(Very) approximate 4x 4x 4x

Isstructureassociatedwithspeciation?

•  Perhaps–  Ideasofhybridspeciation,reuseofallelesselectedindifferentenvironments,hybridswarmsandgeneflow

•  Butthisisanothertalk…

Mightstructurebe(partly)identifiableinthePSMCmodel?

•  TheinferredvaluesN(t)havedimensionT,thenumberoftimebins

•  ButthetransitionmatrixMhasdimensionT2•  CurrentlywederiveMfromNbytheoryassumingpanmixia–  Istherearichertheoryforstructuredpopulations?–  HowtoparameterisestructuralcomplexityS(t)attimet,withassociatedtheoryforM(N,S)

•  OrcanwefitthetransitionmatrixMunconstrained?–  Thensearchforevidenceofstructurewithinit–  Andordogoodnessoffit?

Addinganothersequence

•  Chanceofcoalescencepergenerationfromthreesequencesis3/N

•  Oncewehaveacoalescencewearebacktothesituationwithtwosequences

•  Fromisequenceschanceisi(i-1)/2N

Digression:Coalescentmodel(Kingman,1980)Adistributionontrees

•  T(i)~exponentialwithmean2N/i(i-1)

Propertiesofthecoalescent

•  Asweaddextrasequences,theyareincreasinglylikelytocoalesceveryfast,andincreasinglyunlikelytoaffectthefullTMRCA

•  Treesareveryvariable– E.g.4samples

on6leaves

Theexpectedheightofthetreeformanysamplesisonlytwicethatwithtwosamples

Relationshipbetweenforwardsintime(Wright-Fisher)andbackwardsintime(Coalescent)models

Populationevolutionforwards

Coalescenttreebackwards

ThecoalescenttreedescribesasamplefromtheforwardprocessKingmancoalescentgeneratesan“exact”samplefromWright-Fisher

Geneticvariationinasample

•  Mutationsoccuratrandomonthetree– Separationofsourcesofrandomness

•  Randomdemographytreestructurefromcoalescent•  Randomsamplingofmutationsonthetree

Watterson’stheta

LetSbethenumberofmutations=segregatingsites

E(S)~θlogn

•  Densityofmutationswithfrequencyiinasampleofnisθ/i

•  1/fdistributionofpopulationallelefrequencies

•  Populationminorallelefrequencydistributionofadifferenceobservedbetweentwosequencesisflat–  Probability(1/f).2f(1-f)=2(1-f),foldedat½is2

Distributionofvariantallelefrequencies

sitefrequencyspectrumSFS

Relaxationofassumptions(1)

•  E.g.changeinpopulationsizechangesthesitefrequencyspectrum•  Tajima’sD

– Sensitivetonumberofraremutations,sochangeinNe

–  IfDispositivethereisadeficiencyofraremutations•  Excessrecentcoalescences,recentsmallNe-selection

D>0 D<0

ThisisthebasisofSFS-baseddemographyinference

Fig. 2 The expected site frequency spectrum (SFS) of the derived allele (the new mutation arisen in the population) for three different demographic models: (i) a population that has been of constant size throughout history; (ii) a model previously fit to the derived allele

frequency spectrum of Europeans, which includes an out-of-Africa population bottleneck and a second, more recent, population bottleneck (21); and (iii) the same two-bottleneck model of

European history with the addition of recent exponential growth from a population size of 10,000 at the advent of agriculture to an extant effective population size of 10,000,000, which

amounts to 1.7% growth per generation during the last 400 generations.

A Keinan, A G Clark Science 2012;336:740-743

Individualsinhumanoutbredpopulationsstillcarrymanyvariantsnotinthelargesequencedatasets(1000Genomesetc.)

•  Exponentialpopulationgrowthinlast10,000yearsgiveslongtipstothetree

•  In“big”populations,tipsarehundredsofgenerationslong,sotensofthousandsofprivatevariantspersample,hundredsfunctional

Thisbehaviourisverydependentonpopulationstructure.

Ingeneticisolatestherecenteffectivepopulationsizeissmaller,andthetipsareshorter

Whataboutrecombination?•  Ifpointsonthegenomeareveryclose,e.g.adjacent,theysharethesametree

•  Ifpointsareveryfar,theirtreesaresampledfromthecoalescentindependently

•  Whathappensinbetween?

•  Arecombinationintheancestorofamodernsequencemadeitoutoftwoseparatesequences,onecontributingtotheleftandonetotheright

Recombinationchangesthetreeasyoumovealongthesequence

Typicallyrecombinationrateiscomparabletoorlargerthanthemutationrate:both~10-8/bp/geninhumanSo“genetree”varieseverysiteinmixingpopulations

AncestralRecombinationGraph(ARG)

•  TheAncestralRecombinationGraphdescribesthewaythatindividualsequencesinapopulationarerelated–  Atalocus,sequencesarerelatedbyatree–  Ancestralrecombinationschangethetreeasyoumovealongthechromosome

a ..C..G..A..b ..T..G..C..c ..T..A..A..d ..T..A..C..

0 0 01 0 11 1 01 1 1

a b c d a b c da b c d

1

a b c d

2 3

R

ARG “Pruneandgraft”operationgoinglefttoright

Coalescentwithrecombination

•  ARGisastructure(datatype)•  TheprobabilitydistributionoverARGsthatariseswhenrecombinationisaddedtothestandard(Wright-Fisher)modeliscalledtheCoalescentwithRecombination– Hudson’smssoftwareistheclassicsimulator– NewmsprimefromJeromeKelleherMUCHfaster

•  Nowtwopossibleeventsgoingbackwardsintime–  Coalescence:whichmergestwosequences

•  Forisequences,rateisi(i-1)/2N–  Recombination:whichsplitsasequenceintotwo

•  Forisequences,rateisiLρ

Extendingtomultiplesequences

•  Therecenttimelimitof~20kyaforPSMCissetbecausewerunoutofrecentcoalescencesbetweentwohaplotypes

•  Ifweaddmorehaplotypes,thentherearemorerecentcoalescencesandwecouldlookatmorerecenthistory

•  But,…thehiddenstateisthenatree(withbranchlengths):impracticaltomodelfully– MCMCisnotoriouslydifficult

Option1:Firstcoalescenceofonesequencetothetreeoftheothers

•  ThisisrelatedtotheLiandStephensmodel(orStephensandDonnelly)–chromopainter

t1

t2

t3

t4

t1

t2

t3

t4

•  M(t)isarandomvariable,andweneedtheentirehistoryofM(t)tocalculatetransitionprobabilitiesq

•  Hugeincreaseinstatespaceand/orthisbreaksMarkovassumptions

t1

t2

t3

t4

Problem:Coalescenceofchosensequencetotheothersdependsonthenumberoflineages

M(t)remainingattimet

MCMCapproach:ARGweaver•  Repeatedlyremoveasequence*andadditback,samplingconditionalonremainingARG

•  HMM:samplewithforward-backwardalgorithm

Genome-wideinferenceofancestralrecombinationgraphsRasmussenMD,HubiszMJ,GronauI,SiepelA.PLoSGenet.10:e1004342(2014)

•  Costly–useforinferencegivenhistory

Option2:firstcoalescencebetweenanypair

•  Thisremains(approximately)Markov•  StatespaceisO(M2T)–pairofstatesandtimetheycoalesce– ButtransitionupdatesareonlyO(M2T2),becausetransitionsarememoryless

•  EmissionsfromXijaresingletonsoniorj– Non-singletonsthatarediscrepantbetweeniandjwipeoutdensityatXij

MSMC

StephanSchiffelsandDurbin(NatureGenetics,2015)

MSMCcanfitbothpopulationsizehistoryandseparationhistory

•  Separationviathe(scaled)ratioofcoalescencebetweenandwithinpopulations

Accessmorerecenthistory

Uselowermutationratehere~0.5x10-9/year

Divergencebetweenpopulations

FirstCoalescencewithinPopulation2

FirstCoalescencewithinPopulation1

FirstCoalescenceacrossbothpopulations

• MSMCcaninferseparatecoalescencerateswithinandbetweenpopulations

• Givenrateswithinpopulations,λ11(t)andλ22(t),andacrosspopulations,λ12(t),computerelativegeneflowasratio

λ12(t) [λ11(t)+ λ22(t)] / 2

m(t) =

•  Idea:Inferseparatecoalescencerateswithinandbetweenpopulations:

Testinggeneflowinferencewithsimulatedsplit

☜m(t)=1:perfectlymixed

☜m(t)=0:perfectlysplit

4haplotypes:goodforsplits50-200kya.8haplotypes:goodforsplits5-50kya.

Separationhistory

AlternativestoMSMC

•  MSMC2(Schiffels:inMalaspinas2016/unpub.)– RunPSMC’onallpairsofsequencesindependently– Multiplythelikelihoods–Compositelikelihood

•  Assumesthepairsareindependent,whichisfalse•  Butgivesunbiasedestimation(thoughoverconfident)

•  SMC++(Terhorst,Kamm,Song:NatGen2017)– Pair,withp(het|othersequences)– Verycool–worksevenongenotypedata!– ButthereareapproximationproblemsanalogoustothoseinMSMC–notapanacea

Usingrarevariantstoinferdemographichistory

•  Rarevariantscontaininformationaboutrecentpopulationhistoryandstructure

•  Shownhere:numberofdoubletonssharedamongEuropeansamples

•  Wewouldliketoestimatepopulationsplittimesandpopulationsizesfromthefrequencyofrarevariants

CEU FIN GBR IBS TSI

[1000GenomesProject,Phase3]

ComparetoChromoPainterdata

AncientsamplesfromHinxton

12884A,Ironage

12883A,Saxon

12880A,Ironage

12881A,Saxon12885A,Saxon

MoresamplesfromLinton/Oakington

Sharingpatternsbetweenancientandmodernsamples

•  DifferencebetweenAnglo-SaxonandIronAgesharingwithNEDandIBSconsistentacrossdifferentAlleleCounts

•  SmallbutsignificantdifferencesalsowithinmodernBritain(UK10K):SamplesfromWalesandScotlandsharefewerrarevariantswithDutchpeople

EstimatesofAnglo-SaxoncontributiontomodernBritishgenomes

•  Suggests~30%SaxoncontributiontosamplesinEastofEngland,and~20%toUK10KsamplesfromWalesandScotland

•  Consistentwith20-40%indirectestimatefromPOBI(PeoplesoftheBritishIsles)study

Therareallelecoalescent• Goal:Estimatedemographichistory(populationsizesandsplittimes)fromrarevariants

•  Computelikelihoodofdemographicmodelgivenadistributionofrarevariants

RareCoalmodel

•  Idea:Definerecursionequationsforprobabilityofobservingiderivedallelesinpopulationk:

•  Givenademographicmodel,propagatethisprobabilitybackwardsintimetogetfulllikelihoodofthedata.

•  Keysimplification:Treatnumberofancestralallelesovertimeasaverage(mean-fieldapproximation):

Testinferencewithsimulateddata

100

200

300

400

Time(generations)

10 5 2 4 5

0.4

0.5

0.2

1scaledpopulationsizes

92

200

299

388

10.3 5 2 4 5

0.47

0.5

0.22

0.87

Simulated Estimated

Fits(100samplesperpop.):Pattern Count(real) Count(predicted)

0,1,0,2,1 1114 1159

2,1,0,0,0 140585 139657

1,0,2,0,0 1138 1205

thousandsofrows…

Fittingpopulationsizesandsplittimesseparatesdriftfromdivergence->differentfromTreemix,qpGraphetc.

EuropeanTree(Fits)

Placingancientsamplesonthetree

•  PlotsshowthelikelihoodformergingthepopulationN=1sampleontothetreeasaheatmap

Moredirectcalculationofthelikelihoodofthejointsitefrequency

spectrumwithmomi

JackKamm…Song2016,andunpublished

•  ComplexityofancestralallelestateisreducedbyusingMoranmodel

•  UseAutomaticDifferentiationtocalculategradientstomaximiselikelihoodoverdemographywith(limited)geneflow

MomiappliedtocentralAsiandata

•  Includeancientsamples–  Conditionascertainmentonmodern/deepsamples

•  Totalbranchlengthonthese

–  Randomallelesamplingforlowcoveragesamples

•  Estimatessplittimes•  Bootstrapforconfidenceintervals–  Butbewaremodelmisspecification

withPeterdeBarrosDamgaard,RuiMartiniano,MartinSikora,EskeWillerslevetal.

Momicalculations

•  TocalculateP(x1,x2,x3,…)– SetleavestoIndicator(xi),e.g.[0,0,1,0…0] forxi=2– Propagatelikelihoodsuptree(“tree-peeling”)

•  Cancorrespondinglycalculatetheexpectationofanymulti-linearfunctionofallelecounts– 𝔼[f1(x1)f2(x2)f3(x3)…]

•  bysettingleafito[fi(0), fi(1), . . . , fi(n1)]– Worksbecausepropagationislinear

Examples

•  Totalbranchlength∝chanceofanymutation–  fi(j) = 1, vectoris[1,1,1…1]

•  TMRCAforpopi(i arbitraryunlessancientmodel)–  fi(j) = j/ni , vectoris[0,0.2,0.4,0.6,0.8,1] for ni=5–  fk(j) = 1, k ≠ i

•  f3 = 𝔼[(X1-X3)(X2-X3)], f4 = 𝔼[(X1-X2)(X3-X4)]–  Requirestermssuchas𝔼[X1X2]forwhich–  f1(j) = j/n1, f2(j) = j/n2, fk(j) = 1, k > 2

•  Alsonumerators,denominatorsofFST,Tajima’sD

Summary

•  PSMC(‘)estimatesdemographyfromasinglepairofsequences– Samplesizeisinlengthnotnumber– Quiteacleanmodel– Majorissueispopulationstructure

•  MSMC,MSMC2,SMC++useadditionalsamplestogetatmorerecenttimes

•  RareCoal/MomiusecoalescentmodellingoftheSFSonmoresamplestoestimatetrees– WithlimitedmodelledgeneflowforMomi

Experimentaldesign

•  (Sequence)datacollectioncostsmoney•  Wealwaysneedtomakedecisionsinhowtosampleandsequence– Numberofsamples– Numberofpopulations– Depthofsequencing– WholeGenomeShotgunorRADseqorExomes…

1000GenomesProject

•  Pilot(averylongtimeago!)– 2triosathighdepth30x

•  Phasing,accuratesingle-samplegenotypecalling,mutationrates

– 3populationsx60samplesatlowdepth2-4x+exomes

•  Mainproject– 26populationsof~100(2504total)at6-8x(+exomes)–  (150triosathighdepth–butwhoremembersthem?)

Malawicichlidsequencing

•  Phase1– Threetriosat30x:mutationrateestimation,controls– ~70speciesat15-20x,additionalsamplesforsomeat8-12x

•  Phase2– 7setsof20at15x– Morespecies– Somesetsof24or48toaddressspecificquestions

•  MassokoGWAS(Turner)– 200samplesat4x+100samplesforreplication

Lowcoveragesequencingstrategy

•  Typicallyoneneedstosequenceat~30xdepthtofind(almost)allvariantsinasample

•  Tofindlowfrequencyvariantswewanttosequencemanysamples

•  Spreadsequenceacrossmoresamples

Phase1powerandgenotypingaccuracy

SNPdetection Genotypingaccuracy

HyunMin-Kang(UMichigan)

Callingfromlowcoveragesequence

•  Multi-samplecallsiteswithsamtoolsorGATK•  Obtaingenotypelikelihoodsateachsiteineachsame(alsosamtoolsorGATK)– Likelihood=P(data|genotype)

•  CombineinanimputationframeworkusingBEAGLE(Browning),orMINIMAC(Abecasis),orperhapsSTITCH(Mott)?

•  PhaseusingSHAPEIT2(Marchini)orEAGLE2(Loh)

Sequencingdepth•  30xisstandardfornear-completeaccuracy

–  Sufficienttoestimatemutationratesintrios(needseveraltriosformostspecies)

•  15xisgoodenoughforSNPs(~97%),notquitesogoodforindels(perhaps90-95%)

•  4-8xgivesgoodlowcoverageimputationasinpreviousslides

•  Peoplehaveused1-2x,butthisishardwork…•  60x+isnecessaryforsubclonalstructure,e.g.cancer,highploidy

•  Inacross,sequencethefounderstohighdepth,andtheF2/F3tolowdepth(1xorlessisfine)andimputeusingSTITCHorotherRichardMotttools

top related