ancestral recombination graphs - arizona state...

20
Ancestral Recombination Graphs Ancestral relationships among a sample of recombining sequences usually cannot be accurately described by just a single genealogy. Linked sites will have similar, but not necessarily identical genealogies. Recombination leads to variation in the TMRCA between different sites, which in turn can lead to variation in genetic polymorphism. The ancestral recombination graph (ARG) is a generalization of the coalescent which describes the sequence of genealogies along a sample of recombining sequences. Nordborg (2000) Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 1 / 20

Upload: others

Post on 03-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Ancestral Recombination Graphs

Ancestral relationships among a sample of recombining sequences usually cannot beaccurately described by just a single genealogy.

Linked sites will have similar, but not necessarily identical genealogies.

Recombination leads to variation in the TMRCA between different sites, which inturn can lead to variation in genetic polymorphism.

The ancestral recombination graph (ARG) is a generalization of the coalescentwhich describes the sequence of genealogies along a sample of recombiningsequences.

Nordborg (2000)

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 1 / 20

Page 2: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Variation in Total Tree Length in a Sample of 10 Chromosomes

Hudson (2000)

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 2 / 20

Page 3: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Applications of the ARG

The ARG has several uses:

Recombining sequences are potentially much more informative about demography,admixture and selection than a single completely-linked locus.

Fine scale recombination rate estimation is possible if we can exploit high densitySNP data.

Statistical inference for GWAS can be improved if we can accurately account forthe complex correlations that exist between multiple linked loci.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 3 / 20

Page 4: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Meiotic recombination generates mosaic chromosomes

Strachan & Read (1996)

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 4 / 20

Page 5: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

The Two-locus Ancestral Recombination Graph

For simplicity, consider a sample of n sequences containing just two loci. We will makethe following assumptions.

The population evolves according to the diploid Wright-Fisher model, withpopulation size N.

Each individual samples two chromosomes uniformly at random from the previousgeneration.

When a chromosome is sampled from a parent, it will either be inherited intact(with probability 1− r) or else it will undergo a recombination between the twoloci (with probability r).

When a recombination event occurs, the two loci will be inherited from the twodifferent homologous chromosomes of that parent.

We will assume that no recombination occurs within the loci.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 5 / 20

Page 6: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Suppose that N is large and that r is of order O(1/N). Then, when looking backwardsin time, the ancestral relationships between the sampled sequences are determined bytwo processes:

With probability(n2

)12N

, a randomly chosenpair of sequences coalesces.

With probability nr , a randomly chosensequence is produced by a recombinationevent.

More complex scenarios involving multiplecoalescences or recombination havenegligible probabilities under the aboveassumptions.

Hudson (2000)

When a sequence is produced by a recombination event, the two loci have differentancestors. In this case, the branch experiencing the recombination splits in two, witheach emerging branch corresponding to one of the two ancestral sequences.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 6 / 20

Page 7: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

If N is large and time is measured in units of 2N generations, the ancestry of thesampled sequences can be modeled by a continuous-time Markov chain known as thetwo-locus ancestral recombination graph (Griffiths 1981):

At rate(n2

), a randomly chosen pair of sequences coalesces and the number of

branches in the ARG decreases by one.

At rate nρ/2, a randomly chosen sequence is produced by recombination. Thecorresponding branch splits into two branches, each containing material ancestralto one of the two loci. This increases the number of branches by one.

The process terminates when both loci have reached their MRCA. Because thebranching rate is linear in n while the coalescent rate is quadratic in n, the ARG iscertain to reach an ultimate ancestor (UA) at some finite time.

If the two loci reach their MRCA’s at different times, then it may be possible toterminate the process before reaching the UA.

The parameter ρ = 4Ner is known as the population recombination rate.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 7 / 20

Page 8: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

The two-locus ARG is readily extended to multiple loci. Suppose that the sampledsequences contain n loci and let ri be the recombination rate per generation betweenlocus i and i + 1 and r = r1 + · · ·+ rn−1 be the total recombination rate.

At rate(n2

), a randomly chosen pair of sequences coalesces and the number of

branches in the ARG decreases by one.

At rate nρ/2, a randomly chosen sequence is generated by recombination. In thiscase, the recombination breakpoint falls between locus i and i + 1 with probability

P(B = i) = ri/r .

The affected branch splits into two branches, one containing material ancestral tolocus 1 and the other containing material ancestral to locus 2. This increases thenumber of branches by one.

The process terminates when all loci have reached their MRCA.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 8 / 20

Page 9: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Example: An ARG with Mutation

Arenas et al. (2010)

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 9 / 20

Page 10: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Although the ARG provides an accurate description of the ancestral relationships amonga sample of recombining sequences, using it to analyze sequence data is computationallychallenging for three reasons.

Curse of dimensionality: The number of possible ARG’s for a sample of nsequences containing L sites is ((2n − 3)!!)L.

Weakly informative data: In general, the ARG is only weakly determined by thesequence data.

Long-range dependence: The genealogies at flanking sites remaining correlatedeven if we condition on the genealogy at an intermediate site.

Because of these difficulties, a major research focus has been on the development ofcomputationally efficient approximations for the ARG.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 10 / 20

Page 11: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Composite Likelihoods and the Two-locus ARG

Hudson (2000) proposed a way of approximating the likelihood of recombinant sequencedata based on the two-locus ARG.

In this approach, a likelihood function Lij(ρ) is computed for each pair ofsegregating sites i and j using Monte Carlo-based methods.

The composite likelihood of the complete data is then calculated by multiplyingall of the pairwise likelihoods:

Lcomp(ρ|D) =∏i 6=j

Lij(ρ)

The population recombination rate ρ can be estimated by finding the value ρ̂ thatmaximizes Lcomp.

The composite MLE is known to be consistent, but the composite likelihoodfunction itself is too peaked around ρ̂, i.e., confidence intervals calculated usingthe curvature of Lcomp will be too narrow.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 11 / 20

Page 12: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

A practical implementation of the composite likelihood approach was first made byMcVean et al. (2002) and later updated by Auton & McVean (2007) to handlerecombination rate variation (LDhat2).

Chr 19 (A) and 22 (B)

HLA Region

Source: McVean et al. (2004)

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 12 / 20

Page 13: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

The Sequentially Markov Coalescent (SMC)

An alternative simplification of the ARG was proposed by McVean & Cardin (2005)which removes the long-range dependence of genealogies at different sites.

In this approach, which is known as the sequentially Markov coalescent (SMC),the ARG is approximated by a process that iteratively determines the genealogy ateach position along a chromosome.

The SMC starts at one end of the chromosome and samples a coalescent tree T1

using the ordinary coalescent.

It then generates a sequence of breakpoints b1, · · · , bm, and coalescent treesT1, · · · ,Tm, such that Ti is the genealogy of the n sequences in thenonrecombinant segment (bi−1, bi ).

The procedure for generating the breakpoints and coalescent trees is such that thesequence T1, · · · ,Tn is a Markov process, i.e., conditional on Ti , the treesT1, · · · ,Ti−1 are independent of the trees Ti+1, · · · ,Tm.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 13 / 20

Page 14: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

The sequence of breakpoints and coalescent trees is generated using the followingprocedure:

Given bi and Ti , the distance to the next breakpoint bi+1 is exponentiallydistributed with rate ρ|Ti |/2, where |Ti | is the total branch length in Ti .

Given Ti , the next tree Ti+1 is obtained by sampling a location uniformly atrandom along Ti and detaching this lineage (and its subtree) from Ti . Thisgenerates a ’floating’ lineage, which then coalesces with the remaining parts of Ti .

McVean & Cardin (2005)

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 14 / 20

Page 15: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

The Pairwise Sequentially Markov Coalescent (PSMC)

To the extent that different loci have different genealogies, even a single diploid genomecan be used to make inferences about demographic history and selection.

With just two copies of each locus, the genealogy at each site is fully by specifiedby the pairwise coalescent time at that site.

In this case, the ARG along a chromosome can be represented by a sequence ofbreakpoints b1, · · · , bm and pairwise coalescent times t1, · · · , tm.

Under the SMC, the sequence of pairwise coalescent times becomes adiscrete-time Markov chain.

These sequences (bi , ti ) can be used to make inferences about demography andselection.

Li & Durbin (2011) introduced a method called the PSMC which uses a hiddenMarkov model to infer (bi , ti ) from the sequence data.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 15 / 20

Page 16: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Hidden Markov Models (HMM)

Hidden Markov models can be used to analyze data generated by processes in which thestate of the system is hidden from the observer.

The state of the system evolves according to a discrete-time Markov chain:X1,X2,X3, · · · .Whereas Xt is hidden, at each time t, the observer can measure some variable Yt

that depends only on Xt .

The objective is to use the observations (Yt) to learn about (Xt), which can bedone using dynamic programming algorithms.

Y0 Y1 Y2 Y3 Y4 Y5xe

xe

xe

xe

xe

xe

X0p−−−−−→ X1

p−−−−−→ X2p−−−−−→ X3

p−−−−−→ X4p−−−−−→ X5

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 16 / 20

Page 17: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Under the PSMC, the ancestral recombination graph and the sequence data can berepresented by a HMM.

Both processes (Xi ) and (Yi ) are indexed by position along a chromosome.

The ‘hidden’ variable Xi is the TMRCA at that position.

The ‘observed’ variable Yi is the pair of nucleotides in the sampled genome at thatposition.

For practical reasons, the state space of the hidden variables is usually required tobe finite. To this end, Li & Durbin (2011) replace the continuous interval [0,∞)by a discrete set {s0, s1, · · · , sm}.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 17 / 20

Page 18: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Applications of the PSMC from Li & Durbin (2011)

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 18 / 20

Page 19: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Extensions to multiple sequences

In principle, these ideas can be extended to samples containing more than twosequences.

Hobolth et al. (2007) developed a coalescent HMM to handle data sampled fromtwo or three species.

Because the number of genealogies grows super-exponentially with the number ofsampled sequences, this approach quickly becomes intractable.

Rasmussen et al. (2014) proposed a novel approach based on the SMC-HMMwhich uses a clever MCMC algorithm (‘threading’) to generate a sample of ARG’sfrom the posterior distribution given the sequence data and model parameters.

Threading works by stochastically building up the ARG sequence by sequence.

This approach is implemented in the software package ARGweaver.

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 19 / 20

Page 20: Ancestral Recombination Graphs - Arizona State Universityjtaylor/teaching/Spring2017/BIO545/lectures/ARG.pdf · Ancestral Recombination Graphs Ancestral relationships among a sample

Application: Distinguishing Background Selection from Selective Sweeps

Rasmussen et al. (2014)

Jay Taylor (ASU) Ancestral Recombination Graphs 16 Feb 2017 20 / 20