inferring gene ancestry: estimating gene descent

12
Inrrrnuriimul Srurisricul Rrvirw ( I!JYX). 66, I, 29-40, Printed in Mexico @ lnematiiinal Statistical Institute Inferring Gene Ancestry: Estimating Gene Descent Elizabeth Thompson Department of Statistics, University of Washington, Box 354322, Seattle, WA 981 95-4322, USA Summary Segregation indicators provide a fundamental description of the outcomes of Mendelian segregation. This description facilitates algorithms for computation of probabilities of gene identity patterns among individuals in a specified genealogical structure. Such gene identity patterns determine in turn the pattern of observable trait similarities among relatives. Conversely, patterns of gene identity are the parameters of genealogical relationship that can be estimated from trait data on individuals. Where data on large numbers of individuals or at large numbers of genetic loci are to be analyzed, Markov chain Monte Carlo realizations of underlying Mendelian segregation indicators can be used to provide Monte Carlo estimates of likelihoods for gene ancestry or of genetic model parameters. Genealogical information in genetic data is bounded by the information in patterns of genomic identity by descent. A different form of Monte Carlo likelihood can be used to assess this information. Key wor&: Gene identity; Genealogical inference; Genome sharing patterns; Mendelian segregation; Monte Carlo likelihood; Multiple kinship. 1 Introduction A pedigree is a specification of the genealogical relationships among a set of individuals. A convenient form of this specification is to identify the father and the mother of each individual. Individuals at the top of the pedigree, whose parents are unspecified, are the founders of the pedigree; other individuals are non-founders. Relationships among individuals are defined relative to the specified pedigree; thus, by definition, the founders are unrelated. Mendel’s First Law ( 1866) states that each individual has two “factors” (or genes) controlling a given characteristic, one being a copy of a corresponding gene in the father of the individual, the other a copy of a gene in the mother of the individual. Further, a copy of a randomly chosen one of the two is copied to each child, independently for different children, independently of genes contributed by the spouse. For a single pair of genes, the segregation (copying) of genes is fully specified by segregation indicators S, = 0 if copied gene is parent’s maternal gene = I if copied gene is parent’s paternal gene where i = I, ..., 171 indexes the segregations (parent-child links) in the pedigree. Mendel’s First Law then simply states that the indicators S; are independent, and Pr(S; = 0) = Pr(S; = 1) = 4. In this paper, we show how methods for the analysis of gene descent and inference of gene ancestry may be developed in terms of the segregation indicators S = IS;; i = 1, ..., nz). Historically, this has not been the predominant approach used in analyses of human genetic data. Paradoxically, it is not the approach used in the field of “segregation analysis”, in which the genetic basis of traits is inferred

Upload: elizabeth-thompson

Post on 23-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inferring Gene Ancestry: Estimating Gene Descent

Inrrrnuriimul Srurisricul Rrvirw ( I!JYX). 66, I , 29-40, Printed in Mexico @ lnematiiinal Statistical Institute

Inferring Gene Ancestry: Estimating Gene Descent Elizabeth Thompson Department of Statistics, University of Washington, Box 354322, Seattle, WA 981 95-4322, USA

Summary

Segregation indicators provide a fundamental description of the outcomes of Mendelian segregation. This description facilitates algorithms for computation of probabilities of gene identity patterns among individuals in a specified genealogical structure. Such gene identity patterns determine in turn the pattern of observable trait similarities among relatives. Conversely, patterns of gene identity are the parameters of genealogical relationship that can be estimated from trait data on individuals. Where data on large numbers of individuals or at large numbers of genetic loci are to be analyzed, Markov chain Monte Carlo realizations of underlying Mendelian segregation indicators can be used to provide Monte Carlo estimates of likelihoods for gene ancestry or of genetic model parameters. Genealogical information in genetic data is bounded by the information in patterns of genomic identity by descent. A different form of Monte Carlo likelihood can be used to assess this information.

Key wor&: Gene identity; Genealogical inference; Genome sharing patterns; Mendelian segregation; Monte Carlo likelihood; Multiple kinship.

1 Introduction

A pedigree is a specification of the genealogical relationships among a set of individuals. A convenient form of this specification is to identify the father and the mother of each individual. Individuals at the top of the pedigree, whose parents are unspecified, are the founders of the pedigree; other individuals are non-founders. Relationships among individuals are defined relative to the specified pedigree; thus, by definition, the founders are unrelated.

Mendel’s First Law ( 1866) states that each individual has two “factors” (or genes) controlling a given characteristic, one being a copy of a corresponding gene in the father of the individual, the other a copy of a gene in the mother of the individual. Further, a copy of a randomly chosen one of the two is copied to each child, independently for different children, independently of genes contributed by the spouse. For a single pair of genes, the segregation (copying) of genes is fully specified by segregation indicators

S, = 0 if copied gene is parent’s maternal gene = I if copied gene is parent’s paternal gene

where i = I , ..., 171 indexes the segregations (parent-child links) in the pedigree. Mendel’s First Law then simply states that the indicators S; are independent, and Pr(S; = 0) = Pr(S; = 1) = 4.

In this paper, we show how methods for the analysis of gene descent and inference of gene ancestry may be developed in terms of the segregation indicators S = IS;; i = 1, ..., nz) . Historically, this has not been the predominant approach used in analyses of human genetic data. Paradoxically, it is not the approach used in the field of “segregation analysis”, in which the genetic basis of traits is inferred

Page 2: Inferring Gene Ancestry: Estimating Gene Descent

30 E. THOMPSON

from data on related individuals. However, the passage of genes down pedigrees as described by Mendel’s first law provides the fundamental description of the genetic consequences of genealogical relationship. Thus development in this framework provides both new insights into some standard methods and well known results, and also a basis for the development of new methods.

Relatives have similar observable characteristics (phenotypes) because they carry genes that are identical copies of a single gene in some common ancestor, which has descended to each relative via the relevant segregations. Because these genes are copies of a single gene, they will have similar effects on the phenotypes of individuals. Genes that are copies of a single gene within a pedigree are said to be identical by descent (IBD). In this paper, gene identity by descent is considered only within the context of a defined pedigree; then, by definition, the genes of founders of the pedigree are not IBD.

Phenotypic similarities among relatives result from the genes they share IBD. Among an ordered set of genes, a partition of the set may be used to specify which subsets of the genes are IBD. Among a set of observed individuals, we denote this partition of their genes by J , and refer to it as the pattern of gene identity by descent among the individuals. The segregation indicators S = { Sj; i = 1, ..., m) of equation (1) determine the pattern, J, of genes IBD in any currently observed set of individuals; J = J ( S ) . The probability of any data (i.e. observed phenotypes of the individuals) depends on S only through J ( S ) , and we may write

Pr(data) = c Pr(data I J ( S ) ) Pr(S). s

In partitioning the likelihood in this way, the “genetic model” is separated from the effects of genealogical structure. The probability of a given pattern J ( S ) depends on the genealogical relationship among the observed individuals. Given the gene identity pattern, J (S), the probability of data depends on the different types of genes, their frequencies, and how they affect observable phenotypes. Thus, the passage of genes in pedigrees provides the connection between observable genetic characteristics and the pedigree structure, whether we are estimating relationships from genetic data, estimating the genetic basis of traits knowing the pedigree, or inferring the ancestry and descent of particular genes, knowing both the genetic model and the data.

For simplicity, this introduction has considered only a single pair of genes in each individual, the genes at a single (autosomal) genetic locus. We continue to adopt this simplification in sections 2 to 4. In fact, genes come on chromosomes, and segregations of genes at positions (loci) on the same chromosome are not independent. In the case of data on traits affected by genes at multiple genetic loci, the probability of the total relevant set of segregation indicators depends also on the chromosomal relationships among the genetic loci. In sections 5 and 6, we will consider linked loci, and in section 7 develop a model for a continuous genome.

2 Recursive Descent Probabilities

One thing that follows immediately from the segregation indicators (1) are the well known recursive equations for the coefficient of kinship, @ ( B , C), between a pair of individuals, B and C. Introduced by Wright (1922), this overall measure of relationship between two individuals may be defined as the probability that genes segregating from each of B and C are IBD, being copies of a single gene in some common ancestor of B and C. (An individual is considered his own ancestor.)

Pr(S = 0) = Pr(S = 1) = k. If S = 0, the segregating gene is B’s maternal gene; that is, a gene from the mother of B. If S = 1, the gene is B’s paternal gene. Thus we obtain immediately

@ ( B , C ) = @ ( M B , C ) P ( S = 0) + @ ( F B , C ) P ( S = 1) = ( $ ( M B , C ) + @ ( F B , C))/2 (3)

where MB and FB are the mother and the father of B. Also, from the definition, we have symmetry:

Provided B is not an ancestor of C, we may condition on the segregation S from B, where

Page 3: Inferring Gene Ancestry: Estimating Gene Descent

Inferring Gene Ancestry 31

@ ( B , C ) = @ ( C , B). Thus the only additional equation needed is for the case B = C. In this case, we must consider two independent segregations from B, S1 and S2: Pr(S1 = S2) = Pr(S1 # S2) = 4. If S1 = S2, the same gene segregates each time, and the two segregating genes are necessarily IBD. If S1 # S2, the two genes comprise both the maternal and the paternal gene if B. Combining these possibilities, we have

@ ( B , B ) = ~ ( S I = S2) -k @ ( M E , FB)P(SI # S2) = (1 -k @ ( M B , F B ) ) / ~ .

Together with the boundary conditions

@ ( B , B) = forany founder B, and @ ( B , C) = 0 if B is a founder and not an ancestor of C,

these equations determine the function @() on the pedigree. An important extension to these equations was made by Karigl (1981), who considered the

probability of simultaneous identity by descent, @ @ I , ..., B k ) , of genes segregating from a set of (not necessarily distinct) individuals B I , B2, .:., Bk. Again if BI is not an ancestor of any of B2, ..., B k , conditioning on the segregation from BI gives, analogously to equation (3),

1 @(BI, B2, ..., &) = 5 ( @ ( M B ~ , B2, ..., Bk) + @ ( F B ~ , B2, .... B d ) . (4)

The symmetry of the definition provides that we may collect the arguments for some B1 who is not an ancestor of any of the others to the first t arguments of @. Then, considering the t independent segregations from B1, either the segregating gene is the same in every case, being a random gene from B1, or both the maternal and the paternal gene of B1 are among the t genes. Since

Pr(SI = S2 = ... = S,) = 2-'+1,

we obtain

@ ( B I , ..., B1, B2, ..., Bk) = 2-'+' (@(BI. B2, ...? + (2'-' - 1) @(Me,, FB,. B2, ..., &)) , (5 )

Together with symmetry and boundary conditions, these equations determine the multiple kinship coefficients on any pedigree, although practical implementation can be problematic on a large multi- generation pedigree if k 2 7.

Karigl (1981) was interested primarily in the case k = 4, and in the determination of the prob- abilities of patterns of IBD, J , among the four genes of two individuals, at a single genetic locus. We return to probabilities of gene identity patterns in section 4, but consider first an alternative application of these multiple kinshiyquations.

3 Inference of Ancestry

Suppose that rather than considering the simultaneous IBD of genes segregating from each of a set of individuals, BI, ..., B k we consider instead the probability, @ A ( B I , ..., Bk) that all these genes descend, by segregation over the generations, from a specified set of founder genes A. Such a probability will be of interest if, for example, the individuals B I , ..., Bk are the set of parents of individuals affected by a rare recessive disease. An individual is affected only if helshe receives a defective copy of the gene from each parent. Thus the probability @A(&, ..., Bk) is the probability that the affected individuals are so, given that the set A includes all the founder copies of the defective allele. (We ignore here the possibility of new disease mutations within the pedigree.) That is, @ A ( B ~ , ..., Bk), is the likelihood that the set of founder genes A includes the total set of original founder copies of the defective allele.

Page 4: Inferring Gene Ancestry: Estimating Gene Descent

32 E. THOMPSON

A gene segregating from BI is either the maternal (S = 0) or the paternal (S = 1) gene. Thus, conditioning on the segregation from an individual B1 who is not an ancestor of any of B2, ..., Bk, we obtain, analogously to equations (3) and (4),

@A(BI, B2, ..., Bk) = 5 (@A(MB,, B2, ..., Bk) -k @A(FB,, B2, ..., Bk)) the criterion of IBD of the genes being replaced by that of descent from A. The other recursive and symmetry equations defining @A() are also exactly as before; the only difference is in the boundary conditions, since now, for founders, the question is whether the segregating gene is a gene in A rather than whether the genes are IBD. In principle, the recursive equations developed by Karigl (1981) thus provide likelihoods for alternative hypotheses concerning the founder origins of disease alleles.

In practice, there are several difficulties with this approach to ancestral inference. First, these are rarely the only data. There is normally information that other current individuals are not affected. For a lethal disease, it is known that ancestors of current individuals cannot have been affected. Although the approach can be modified to take account of such additional data (Thompson & Morgan, 1989), the simplicity is lost. Second, although a likelihood for any set A is computable, there is a very large number of discrete alternative hypotheses. Further, there are statistical issues in the comparison of likelihoods, for sets A of different size, possibly overlapping, or even including each other. These issues become more complex if the sets A are not restricted to founder genes, but may include the genes of other ancestors; the genes in A may then themselves be IBD. Thirdly, there are computational issues; while the set A is not a limiting factor, the number k of current genes must be restricted. As with multiple kinship coefficients, on a multigeneration pedigree the method is restricted to k I 6 for most practical purposes.

1

More recently an alternative approach to ancestral inference has been adopted. Ideally, we would like to compute Pr(GF I Y), where GF is a specification of the types of the genes carried by all the founders of the pedigree, and Y is the total set of phenotypic data on the pedigree. On a large complex pedigree, exact computation of this probability is infeasible, but it is, in principle, possible to sample from it using Markov chain Monte Carlo (MCMC). The joint probability of data Y and the types of underlying genes of all members of the pedigree G is very easily computed:

P ~ ( Y I G) = n Pr(y''' I ci) ohser tied

Pr (G) = n Pr(G;) n Pr(G; I c M i , ~ f i - i ) f oundcrs non- f oundcrs

where Y ( ; ) are the data on individual i, and C; the types of the underlying genes carried by this individual. Then the desired conditional probability is proportional to this joint probability

Pr (G 1 Y) = Pr (Y, G) / Pr (Y) 0: Pr (Y I G) Pr (G)

and thus may be sampled from using MCMC.

The conditional independence structure of genotypes in a pedigree is also straightforward; indi- viduals receive copies of their parents' genes, and, together with their spouses, segregate genes to their offspring. However, simple MCMC methods such as the Gibbs sampler are ineffective; the data may be few, but are highly copstraining on G, and Mendelian segregation also places many constraints on the feasible values of G. The simulated tempering MCMC method has been used to resolve these problems and has been used to sample successfully from Pr (G I Y) on a 12-generation 5,000-member, highly complex pedigree (Geyer & Thompson, 1995). However, it seems that pedi- gree stretched the limits of even that approach, and for problems this 1arge;'or larger, alternative MCMC methods are needed (Geyer: personal communication).

Page 5: Inferring Gene Ancestry: Estimating Gene Descent

Inferring Gene Ancestry 33

4 GIBD at Independent Loci: Inferring Relationship

We return now to the patterns of gene IBD implied by a genealogical relationship, and to the converse problem of estimating genealogical relationships from observed data, or (assuming “perfect” data) from realized patterns of gene IBD. To introduce the ideas, consider first the simplest case. This is the case of two non-inbred individuals; that is, the two parents of each individual are not related to each other. There are then just 3 possibilities at each locus; the individuals share 2 genes, 1, or 0, with probabilities, determined by their relationship, kz, k l , ko with k2 + kl + ko = 1. The probability of phenotypic data Y on the two individuals, for a trait determined by alleles at a single genetic locus, is

2

Pr(Y) = c k i Pr(Y I Ji) i=O

where Ji is the state in which individuals share i genes, i = 0, 1,2. For data Y = (Yl, ..., YL) at L independently segregating loci

Different genetic loci will provide different probabilities Pr( YI I J i ) ; these probabilities depend on the genetic model at the locus-the alleles, their population frequencies, and their phenotypic effects. More generally, for data on any individuals

where J ( S ) is the single-locus pattern of gene identity, determined by the pattern of segregation indicators S (see equation (2)).

Thus k = (ko , k l , k2) or more generally the single-locus probabilities P r ( J ( S ) ) determined by a genealogical relationship are the parameters of relationship; they define the phenotypic consequences of any given genealogical relationship. Conversely, these parameters may be estimated from observed data Y at independently segregating loci, via the likelihood (6).

However, even in the simplest case of estimation of k = (ko , k l , k2) , there are complications in this estimation. One is that many different relationships can give the same value of k and hence are indistinguishable on the basis of data at independently segregating genetic loci. Another is that not all values of k = (ko , k l , k2) with ko + kl + k2 = 1 correspond to genealogical relationships. Not only must each ki be a dyadic rational, but k: 1 4kok2 (Thompson, 1976), and many standard relationships lie on one of the three boundaries of the space (k2 = 0, ko = 0 or k: = 4kok2).

Moreover, even in this simplest case, relationships can be complex. In particular it is possible for each parent of each of the two individuals to be related both to the mother and the father of the other individual, without the two parents of either individual being related to each other. The simplest example is that of quadruple-half-first-cousins(Figure 1) which has k = (g, A, A). This point lies midway between that for half-sibs (k = (i, $, 0)) on the boundary k2 = 0 and that for double-first cousins (k = (5 , i, A)) on the boundary k i = 4kok2. All three relationships have the same kinship coefficient; $ = i ( 2 k 2 + k l ) = 0.125.

This raises the major statistical difficulty in the estimation of relationship; (ko, k l , k2) are trinomial cell probabilities. In practice, the state J at a given locus is seldom known with certainty. Even if we assume such perfect data, data at many independently segregating loci would be required to obtain a precise estimate of k, or to distinguish values as close as those of double first cousins and quadruple half first cousins

Page 6: Inferring Gene Ancestry: Estimating Gene Descent

34 E. THOMPSON

Figure 1. A pedigree showing the refurionship ofquudrupfe-~~-~rsi-ci)usinF. This is the relutiunship beiweeri rhe twii third- generuiion individuuls in the jgure. Neirher individuul is inbred, ulthough the purenr of euch is reluted io boih the mother und rhe.futher of rhe orher:

This problem is compounded, if we consider joint data on more than two individuals. The number of possible states J ( S ) increases rapidly with the number of individuals. For example, between the 12 genes of six individuals at a single locus there are 198,091 states with distinct genotypic consequences (Thompson, 1974). Of course, for simple relationships not all states will have positive probability. Nonetheless, in principle, estimation of the cell probabilities of a 198,091-cell multino- mial is required. When this framework was developed, the limits were computational; 198,091 was a large number. Now, for six individuals, computations are quite feasible, using the same algorithms (Thompson, 1974) but today’s computers. However, we can never have data at the number of inde- pendently segregating genetic loci required to estimate accurately even a small fraction of the IBD state probabilities.

5 Gene Identity at Linked Loci

Since there are insufficient independently segregating loci for useful inference, we consider now linked loci. Segregation indicators may be defined as before (equation (1)):

Page 7: Inferring Gene Ancestry: Estimating Gene Descent

Inferring Gene Ancestry 35

Sir = 0 if copied gene at segregation i locus 1 is parent’s maternal gene

= 1 if copied gene at segregation i locus 1 is parent’s paternal gene. (7)

Here i = 1 , ..., m indexes the segregations of the pedigree, and I = 1. ..., L indexes the genetic loci. The marginal distribution of each Sir is as before: Pr(& = 0) = &(Sir = 1) = i. For different segregations i, the Sir are independent, and the dependence between loci is measured by the recombination frequency. For two given loci (1 = 1,2) the recombination frequency 8 between them is

1 8 = Pr(Si.1 # Si.2) for each i, 0 5 8 5 -.

2 For loci that are close together on a chromosome, 8 is close to 0. For independently segregating loci, 8 = k. The joint distributions of Sir over more than two loci will be considered in the next section. For notational convenience, we assume that 8 does not vary with i. In practice, recombination frequencies vary among segregations, a major factor in this variation being the sex of the parent. Computationally, such variation can easily be accommodated.

The recursive equations for multiple kinship coefficients (equations (4) and (5)) extend to multiple loci, conditioning on the segregation indicators in a given segregation, over the loci in question. Consider, for example, the case of +z(Ll (B(‘ ) , C), Lz(B(I) , D)). This expression denotes the two locus kinship probability, that, in a single gamete segregating from B, the gene at locus L1 is IBD to that on a gamete segregating from individual C, while the gene at locus LZ is IBD to that on a gamete segregating from individual D . The identical superscript “(1)” on the individual B indicates that we are considering here a single segregation i from B, rather than two separate segregations to different offspring. Now if B is not an ancestor of C or D, we may condition on the four events (S i , l , S,.2) = ( O , O ) , (0, l ) , (1, 1) . (1,O) with probabilities $(l - 8), 48, i ( l - O), k8 respectively, where 8 is the recombination frequency between locus 1 and locus 2. Thus we obtain

1

1

1

1

The full set of equations for determining two-locus gene identity probabilities between genes segre- gating from up to four individuals are given by Thompson (1988). These equations can be used to determine two-locus IBD state probabilities, even on a large and complex pedigree.

Several interesting features of these state probabilities emerge. One is that relationships that were not previously identifiable now become so. Consider for example the case of half-sisters, a grandmother-granddaughter, and an aunt-niece relationship. Each of these relationships has k = (i , i, 0), and hence they are indistinguishable on the basis of data at independently segregating loci. However, the probabilities that the two relatives share genes IBD at both of two linked loci, with recombination frequency 8, are i (1 - 28 + 202), $ (1 - 8) and (2 - 58 + 88’ - 403) respectively. In principle, the relationships are distinguishable on the basis of data at linked loci. However, there are relationships that are non-identifiable on the basis of data at pairs of loci (whatever the values of the recombination frequencies between them), but that are identifiable on the basis of data at trios of loci (Thompson, 1988). One may cohjecture that there are relationships non-identifiable on the basis of data at L loci, but identifiable on the basis of data at L + 1 loci.

wx~(? c), L ~ ( B ( I ) , D)) = - e)+22(~1(~:)r c), L ~ ( M : ) , DN

+ Z w 2 ( ~ I ( ~ s , c), L Z ( F B , D))

+ p - W ~ ( L ~ ( F ; ) , c), L&,? DN

+ 2 e + 2 2 ( ~ I ( ~ B , c), L ~ ( M ~ , D ) ) .

Page 8: Inferring Gene Ancestry: Estimating Gene Descent

36 E. THOMPSON

6 Segregation Indicators at Linked Loci

The binary segregation indicators (equation (7)) provide a means to trace the descent and ancestry of genes at multiple linked loci. They determine patterns of gene-identity-by-descent, which in turn determine patterns of phenotypic similarity among relatives. However, first we require a model for the joint probability distribution of these indicators. The simplest model, and adequate for our purpose, is that due to Haldane (19 19). The points along the chromosome at which Sir switches from 0 to 1, or from 1 to 0, are known as crossovers. Haldane’s model is that crossovers occur as a Poisson process; he defined the unit of genetic distance (the Morgan) such that the rate of this Poisson process is 1 Morgan.

Under this model, the dependence structure of the Sil takes a simple form, with a spatial Markov property over loci I , and with segregations i being independent. The probability of any given indicator Sir conditional on all the others, S+), depends only on the indicators for the same segregation and the two neighboring loci:

where for notational convenience we assume the loci to be numbered sequentially along the chro- mosome. This structure again suggests MCMC. Recall again equation (2);

Pr(Y) = C p r ( y I J ( s ) ) R(s).

A Monte Carlo estimate of the likelihood ratio between two alternative relationships may be obtained by sampling from

S

Pr(S ) Y ) = Pr(Y,S)/Pr(Y) a Pr(Y IS)Pr(S)

since

where the subscripts on probabilities and expectation designate two alternative relationship hypothe- ses (Thompson & Guo, 1991). Note that we consider MCMC sampling of S not of IBD patterns J ( S ) ; the segregation process S is Markov along the chromosome, but the agglomerated process J ( S ) is not.

Suppose, as for equation (6), that the data Y partition into disjoint parts YI determined by the types of the genes only at locus 1. Conditional simulation by MCMC requires efficient computation of the probability ratio

Pr(Yl I S:!, i = 1, ..., m) Pr(Yl I Sir, i = 1, ..., m )

- - Pr (Y I S*) Pr (Y 1 s)

for a current S and a proposed S*. However, this is less computationally demanding than at first appears: if the proposed and current segregation patterns differ only at a single locus 1, the problem reduces to a single locus computation. As discussed in section 4, it is then feasible for data Y, on only a few (5 6, say) individuals.

7 Genome Identity by Descent

As we obtain data at more and more loci, each additional locus adds less and less information about patterns of genome sharing among the individuals sampled. There is a limit to the amount of information in the entire genome. Moreover, when there are more loci than there are crossover events in the pedigree it is more efficient to keep track of the crossovers than of the loci. In modern

Page 9: Inferring Gene Ancestry: Estimating Gene Descent

Inferring Gene Ancestry 37

studies, where 200 sampled loci per Morgan is not impractical, that point is already reached. Consider therefore the segregation process Si (z) = 0 or 1 as the parent’s maternal or paternal chromosome is being copied, in segregation i at position z. Under Haldane’s model, the processes Si(z) are independent binary Markov processes, each switching at a rate 1Morgan.

Again we shall consider “perfect” data, in which the IBD status can be observed without error or uncertainty, and for simplicity we shall consider inferences from two gametes, one from each of two relatives whose genealogical relationship is to be estimated. Thus the data consist of a realization I(z) = 0 or 1 as the two gametes are not or are IBD at position z in the genome. The analogue of equation (2) is now

Exact computation is infeasible, except for very simple relationships. Monte Carlo provides an alternative, but there appears to be no straightforward MCMC method. Note that the switch-points of the data-IBD process I(z) are constrained to be a subset of the crossover points on the pedigree.

A solution has been proposed by Guy (1997; unpublished), who restores a discrete sum in (8) by considering the discrete jump chain C of the process S(z):

(9)

The jump chain is a Markov chain on 2‘ states, where t is the number of segregations in the pedigree defining the relationship. A subset of the states give I = 1, and the remainder give I = 0. The likelihood (9) is estimated by Monte. Carlo, using realizations c resulting from processes Si(z), i = 1, ..., t. The Monte Carlo estimate from N realizations d”), n = 1, ..., N of the jump chain C is

Pr(Y = I(z)) = C p r ( y = I(Z) I c = c) Pr(C = c). C

n=l 1.

The continuous IBDhon-IBD segment lengths of the data I(z) are matched to the discrete step- number IBDhon-IBD segments of the realization &), so that each term in the sum is a product of probabilities of the observed length of the I(z) segment, conditional on the number of steps in the matched segment of d”). Edge effects must be accounted for, both in conditioning on the IBDhon- IBD state I(0) at the start of a chromosome, and in adjusting the contribution from the final segment of each chromosome (Guy, 1997). Additionally, N separate realizations of C with N sufficiently large for accurate Monte Carlo estimation would be impractical. Great efficiencies can be achieved by a method of recycling realizations, using overlapping portions of a single long realization of the jump chain C; details are given by Guy (1997).

As an example, we consider Monte Carlo log-likelihoods for first cousin and for great-grandparent relationships computed as outlined above. These two relationships both have kinship coefficient 1/1 = A, so are not distinguishable on the basis of data only on the proportion of loci at which the segregating gametes are IBD. Data on IBD patterns at linked loci, or on the process I(z) do distinguish the relationships, in principle.

First, data were simulated, consisting of 50 pairs of gamet.es each of length 10 Morgans from a first cousin relationship, and 50 pairs from a great-grandparent relationship. Then for each data chromosome, Monte Carlo estimates of the log-likelihood of a first cousin and of a great-grandparent relationship were obtained; these log-likelihoods are plotted against each other in Figure 2. Sufficient Monte Carlo was performed that the Monte Carlo standard errors on the points in Figure 2 are negligible.

We see there is some information, although very little. Data chromosomes deriving from a first cousin relationship tend to have a slightly higher log-likelihood of being from that relationship, while

Page 10: Inferring Gene Ancestry: Estimating Gene Descent

38 E. THOMPSON

Chromosome Log-Li kelihoods

0

(1 .c v) C

rn

.-

.- o ?

2

57 8

g ?

c. - v) C .-

.c 0 W

r Y

sop

- Q)

0 -

+ data from greatgrandparents pedigree

I F 0 data from cousins pedigree 0

-10 -8 -6 -4 -2 0 2 log likelihood of greatgrandparent relationship

Figure 2. Plot of the Monte CwkJ kog-/ikelihoods of k,.eat-grcindptire~t iind first-cousin relrrticimhips bused on simuluted IBD dutu between Rumetes. Euch point is bused titi u 10 Morguns length of gamete. Rir further detuils, see t a t . Figure pmvided by Sharon Guy.

those deriving from a great-grandparent relationship tend to have a higher log-likelihood for being great-grandparent and great-grandchild rather than first cousins. However quite a number of points have higher likelihood for the “wrong” relationship; if discrimination were attempted on the basis of each 10 Morgan length of data there would be many errors. However, combining the data, the outcome is clear; 500 Morgans of data is more than enough to distinguish these relationships. In fact, about 200 Morgans seems to be what would be required for reliable inference (Guy, personal communication); unfortunately, the human genome is only about 30 Morgans long, and other mammalian genomes are of comparable length.

8 Conclusion

Segregation indicators provide the most basic description of gene descent and hence of the observable consequences of genealogical relationship. The independence structure of single-locus segregation indicators leads to computational algorithms for gene descent and gene ancestry, and

Page 11: Inferring Gene Ancestry: Estimating Gene Descent

Inferring Gene Ancestry 39

thence to probabilities of gene identity by descent. Conversely, frequencies over independptly segregating loci of different patterns of gene identity are the identifiable parameters of genealogical relationship. However, not all genealogical relationships are identifiable on the basis of data at independently segregating loci. Even where identifiability obtains, the information per locus is very small. There are insufficient independently segregating loci in the genome for precise estimation.

Thus analysis of data at linked loci is required, and again the segregation indicators provide a basic description of the process, linkage of loci along a chromosome being reflected in the Markov dependence of the corresponding segregation indicators. The dependence structures, both of genes segregating from parents to offspring in a pedigree and of segregation indicators along a chromosome, lend themselves to Markov chain Monte Carlo, leading to Monte Carlo likelihoods for inference of gene ancestry or for estimation of genetic parameters.

The information in patterns of gene identity at linked loci is bounded by the information in patterns of genome sharing. Here Markov chain Monte Carlo does not seem practical, but other forms of Monte Carlo likelihood can address the question. It is found that information about relationship in individual genomes is limited. This information is limited not by current limits on molecular biotechnology or computation, but by the finite length of genomes. It might be considered that if the differences between relationships in terms of patterns of genome sharing are so small, these differences must be unimportant, but this is not necessarily so. A comparison might be made with genetic selection, where effects of differential survival or fertility far too small to be measured on a generation-to-generation basis can have enormous impact on an evolutionary time scale. So also the small effects of different patterns of genome sharing, resulting from different population structures, may have large long-term effects.

Segregation indicators underlie the realized patterns of IBD in a pedigree. By analysis of the segregation process, whether by direct probability arguments, by MCMC, or by other forms of Monte Carlo using realizations of the process, we can analyze the IBD patterns at multiple loci, or even on a genomic basis. New computational and analysis ideas enable us to obtain likelihoods for genealogical relationships on the basis of genetic data, and hence obtain a greater understanding of the genetic consequences of genealogical relationship.

Acknowledgment

This paper is based on work presented at the Fourth World Congress of the Bernoulli Society, Vienna, August 1996. Research supported in part by NIH grant GM-46255. I am grateful to Sharon Guy for figure 2 and for many discussions.

References Geyer, C.J. &Thompson, E.A. (1995). Annealing Markov chain Monte Carlo with applications to ancestral inference. Journul

Guy, S . (1997). The information in genomic IBD data. Journul of Ciimpututionul Bioliigy: in press. Haldane. J.B.S. (1919). The combination of linkage values and the calculation of distances between the loci of linked factors.

Karigl, G. (1981). A recursive algorithm for the calculation of gene identity coefficients. Annuls i fHuman Genetics, 45,

Mendel, G. (1866). Experiments in Plunt Hybridi.wtiiin. Mendel’s original paper in English translation, with a commentary

Thompson, E.A. (1974). Gene identities and multiple relationships. Biomerrics, 30,667-680. Thompson, E.A. (1976). A restriction of the space of genetic relationships. Annub ifHumun Genetics, 40,201-204. Thompson, E.A. (1988). 7bo-locus and three-locus gene identity by descent in pedigrees. IMA Jiiurnul i f Muthemutics

Thompson, E.A. & Morgan K. (1989). Recursive descent probabilities for rare recessive lethals. Annu13 ifHumun Genetics,

Thompson, E.A. & Guo, S.W. (199 I). Evaluation of likelihood ratios for complex genetic models. IMA Journul ifMuthemutics

i f the Americun Stutisticul Associution, 90,909-920.

Journul of Genetics, 8, 299-309.

299-305.

by R.A. Fisher: Edinburgh: Oliver & Boyd, 1965.

Applied in Medicine & Biiilogy, 5,261-280.

53.357-374.

Applied in Medicine & Bioliigy, 8, 149-169.

Page 12: Inferring Gene Ancestry: Estimating Gene Descent

40

Wright, S.

Rksumd

E. THOMPSON

1922). Coefficients of inbreeding and relationship. America Nururulisr, 56,330-338.

Les indicateurs de sbgdgation nous foumissent une description fondamentale des n5sultats de la skgn5gation mendelienne. Cette description facilite les algorithmes pour le calcul des probabilitk de structures d’identitk des gknes parmis les individus d’une structure g6n6alogique sp6cifique. Ces structures d’identib5 des gknes dkterminent B leur tour les structures de similarit6 des traits observables p m i s des gens de &me famille. Inversement, les structures d’identit6 des g&nes sont les param&res des relations g6nMogiques estimables B partir de d o n n h de traits d’individus. Dans les cas ob les d o n n h B analyser proviennent d’un grand nombre d’individus ou d’un grand nombre de loci gknetiques, des dabsations Monte Carlo de chaine de Markov des indicateurs de stgdgation mendelienne sous-jacents peuvent 6tre utili& pour foumir des estimh Monte Carlo de vraisemblance pour I’ascendance des g h e s ou de p d t r e s de modkles gknttiques. L‘infonnation gknhlogique de d o n n h g6n6tiques est limit& par I’infomation dans les structures d’identit6 g6nomique par origine. Une autre forme de vraisemblance Monte Carlo peut Etre utiliske pour kvaluer cette information.

[Received July 1997, accepted July 19971