accelerated gene counting for haplotype frequency estimation

5
doi: 10.1046/j.1529-8817.2003.00054.x Accelerated Gene Counting for Haplotype Frequency Estimation Alun Thomas Department of Medical Informatics and the Center for High Performance Computing, University of Utah, 391 Chipeta Way, Suite D, Salt Lake City, UT 84108, USA Summary Current implementations of the EM algorithm for estimating haplotype frequencies from genotypes on proximal loci require computational resources that grow as nh 2 k , where n is the number of individuals genotyped and h is the number of haplotypes possible on k loci. For diallelic loci h k = 2 k . We present an approach whose computational requirement grows as n 2 t where t is the largest number of loci at which an individual in the sample is heterozygous. The method is illustrated by haplotype frequency estimation from a sample of 45 individuals genotyped at 26 single nucleotide polymorphisms in the PIK3R1 gene. Keywords: EM algorithm, association mapping, gene counting, computational efficiency Introduction The gene counting method devised by Ceppellini, Siniscalo & Smith (1955) solved the problem of ob- taining maximum likelihood estimates, MLEs, for allele frequencies when the alleles are not expressed codom- inantly. In the canonical example frequencies for the alleles of the ABO blood group system are obtained de- spite O being recessive to both A and B. This is done by allocating the count of individuals with blood type a partially to each of the genotypes AA and AO, with the share being determined by an initial guess at the relative frequencies of the A and O alleles. Given the inferred genotype counts it is straightforward to get MLEs of allele frequencies. With these new allele frequency es- timates the count for type a individuals is reallocated, which leads to new allele frequency estimates and so on. Similarly with type b individuals. By iterating through this process we obtain MLEs of the allele frequencies given the incomplete information provided by the blood group phenotypes. Dempster, Laird & Rubin (1977) provided considerable generalisation of this method to what is now the well known EM algorithm. Even with- out this generalisation, however, it is clear that gene Correspondence: Alun Thomas, E-mail: [email protected]. utah.edu counting can be applied to the problem of estimating haplotype frequencies from observations of genotypes at proximal loci, and this has been done by several au- thors (Excoffier & Slatkin, 1995; Hawley & Kidd, 1995; Long, Williams & Urbanek, 1995). Other developments have included parsimonious methods (Clark, 1990), and a Bayesian approach (Stephens, Smith & Donnelly, 2001; Niu et al. 2002). The problem with the EM method is that the compu- tational requirements of current implementations grow exponentially as the number of loci grows. For instance, if we are considering k diallelic loci there are 2 k possible haplotypes, 2 k1 (2 k + 1) possible unordered haplotype pairs, and 3 k possible sets of genotype observations. In some cases, for example a very large sample and a few loci, all haplotypes might be observed, or at least can- not be shown not to occur given the incomplete in- formation provided by genotypes. However, even with moderately sized samples and a moderate number of loci it will be clear that many haplotypes do not accumu- late any of the partial counts derived from the genotype counts. The method presented here makes use of this by building up the set of haplotypes that have non-zero counts and then carrying out the EM algorithm us- ing only these haplotypes, and the haplotype pairs and genotype lists derived from them. 608 Annals of Human Genetics (2003) 67,608–612 C University College London 2003

Upload: alun-thomas

Post on 06-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerated Gene Counting for Haplotype Frequency Estimation

doi: 10.1046/j.1529-8817.2003.00054.x

Accelerated Gene Counting for HaplotypeFrequency Estimation

Alun ThomasDepartment of Medical Informatics and the Center for High Performance Computing, University of Utah, 391 Chipeta Way,Suite D, Salt Lake City, UT 84108, USA

Summary

Current implementations of the EM algorithm for estimating haplotype frequencies from genotypes on proximalloci require computational resources that grow as nh2

k , where n is the number of individuals genotyped and h is thenumber of haplotypes possible on k loci. For diallelic loci hk = 2k . We present an approach whose computationalrequirement grows as n2t where t is the largest number of loci at which an individual in the sample is heterozygous.The method is illustrated by haplotype frequency estimation from a sample of 45 individuals genotyped at 26 singlenucleotide polymorphisms in the PIK3R1 gene.

Keywords: EM algorithm, association mapping, gene counting, computational efficiency

Introduction

The gene counting method devised by Ceppellini,Siniscalo & Smith (1955) solved the problem of ob-taining maximum likelihood estimates, MLEs, for allelefrequencies when the alleles are not expressed codom-inantly. In the canonical example frequencies for thealleles of the ABO blood group system are obtained de-spite O being recessive to both A and B. This is doneby allocating the count of individuals with blood type apartially to each of the genotypes AA and AO, with theshare being determined by an initial guess at the relativefrequencies of the A and O alleles. Given the inferredgenotype counts it is straightforward to get MLEs ofallele frequencies. With these new allele frequency es-timates the count for type a individuals is reallocated,which leads to new allele frequency estimates and so on.Similarly with type b individuals. By iterating throughthis process we obtain MLEs of the allele frequenciesgiven the incomplete information provided by the bloodgroup phenotypes. Dempster, Laird & Rubin (1977)provided considerable generalisation of this method towhat is now the well known EM algorithm. Even with-out this generalisation, however, it is clear that gene∗Correspondence: Alun Thomas, E-mail: [email protected]

counting can be applied to the problem of estimatinghaplotype frequencies from observations of genotypesat proximal loci, and this has been done by several au-thors (Excoffier & Slatkin, 1995; Hawley & Kidd, 1995;Long, Williams & Urbanek, 1995). Other developmentshave included parsimonious methods (Clark, 1990), anda Bayesian approach (Stephens, Smith & Donnelly, 2001;Niu et al. 2002).

The problem with the EM method is that the compu-tational requirements of current implementations growexponentially as the number of loci grows. For instance,if we are considering k diallelic loci there are 2k possiblehaplotypes, 2k−1(2k + 1) possible unordered haplotypepairs, and 3k possible sets of genotype observations. Insome cases, for example a very large sample and a fewloci, all haplotypes might be observed, or at least can-not be shown not to occur given the incomplete in-formation provided by genotypes. However, even withmoderately sized samples and a moderate number of lociit will be clear that many haplotypes do not accumu-late any of the partial counts derived from the genotypecounts. The method presented here makes use of thisby building up the set of haplotypes that have non-zerocounts and then carrying out the EM algorithm us-ing only these haplotypes, and the haplotype pairs andgenotype lists derived from them.

608 Annals of Human Genetics (2003) 67,608–612 C© University College London 2003

Page 2: Accelerated Gene Counting for Haplotype Frequency Estimation

Fast Gene Counting for Haplotypes

In most cases the number of haplotypes required isvery small compared with the number of all possiblehaplotypes. Clearly, the total number of haplotypes thatcan actually exist in a sample cannot be more than twicethe number of individuals and will typically be verymuch smaller. Even with the extra haplotypes that haveto be considered due to incomplete information, thetotal that are needed grows far slower than the totalpossible.

The basis of the method is this. Consider the numberof possible haplotypes for two linked loci. This is equalto the product of the number of alleles at the two loci.However, if in some particular sample of genotypes atthese loci, an allele at either locus happens to be absentwe can, while we are dealing with this particular sample,eliminate it from consideration at the locus and also fromconsideration in the haplotype. Thus, the number ofhaplotypes we have to deal with is reduced. Our methodexploits this observation by breaking up the requiredhaplotypes into sub haplotypes.

This method is not a parsimonious one. We are notlimiting ourselves to some minimal set of haplotypesrequired to explain the genotypes. Nor is it an approx-imation to the EM method. Rather, it is an efficientway of detecting and eliminating at an early stage allhaplotypes whose MLE of frequency is exactly zero.

A Java program that implements the method is avail-able from the author.

Method

We will build up our haplotypes by considering increas-ing sets of markers.

The alleles, or haplotypes, across a marker set are are ingeneral, in the absence of data on relatives, unobservablebecause we cannot observe the phase of the componentmarkers. Given two sets or markers l1 and l2 whose in-tersection is empty, the set of possible haplotypes onl 3 = l 1 ∪ l 2 will be the product of the sets of haplo-types on l 1 and l 2. The set of haplotypes actually presenton l 3 will be a subset of this product. The haplotypeson l 1 or l 2 are sub haplotypes of those on l 3. Those on l 3

are super haplotypes of those on l 1 and l 2.Our problem is to infer the frequencies of haplotypes

on a set of k, say, markers, given genotypic informationat those markers.

Our method is as follows.

1. Initialise a set of marker sets L to contain k sets eachcontaining a different one of the original k markers.

2. Arbitrarily choose any two marker sets l 1 and l 2 fromL and remove them from L.

3. Create a new set l = l 1 ∪ l 2 whose haplotypes arethe product of those on l 1 and l 2.

4. Run one round of the EM algorithm on l consid-ering only the genotypes for markers that are in l.Make the initial haplotype frequencies uniform.

5. Remove from the set of haplotypes on l any whosecurrent estimate of frequency is identically zero.

6. If L is not empty, add l to L and iterate from 2 above.7. If L is empty, l contains all the markers, and we have

found all the haplotypes whose frequency MLE isnot identically zero. Run EM on l to convergenceto obtain frequency MLEs.

Since the haplotypes removed at stage 5 have zeroMLE, any super haplotype that contains that set of al-leles at those markers will also have zero MLE. Thus,the set of haplotypes of l that we end up with will beindependent of the order in which the loci were in-corporated and will contain all haplotypes for which wewould have obtained a non zero MLE if we had run EMon the full set of markers with all possible haplotypes.

Loci can be incorporated in several ways and this canbe done independently of the chromosomal order. Theprograms available from the author incorperate makersone at a time in the order specified in the input file.If a divide and conquer approach is used, recursivelysplitting the list of markers approximately in half at eachstage this is is similar to the partition ligation method ofNiu et al. (2002), however, their implementation of EMis restricted to 15 polymorphic loci. As will be seenbelow this method can handle far more loci.

Examples

A Trivial Illustration

Consider 4 markers, each with alleles 0 and 1. Also con-sider a sample of size 4 that has the observed genotypelists given in table 1, where genotypes 0, 1 and 2 cor-respond to the number of occurrences of the allele 1 ateach locus.

C© University College London 2003 Annals of Human Genetics (2003) 67,608–612 609

Page 3: Accelerated Gene Counting for Haplotype Frequency Estimation

A. Thomas

Table 1 Raw data

l1 l2 l3 l4

0 1 2 01 0 2 01 0 1 00 2 0 1

Table 2 Initial pairings

l 5 = l 1 ∪ l 2 l 6 = l 3 ∪ l 4

Markers 1 and 2 Markers 3 and 4

Haplotype Count Haplotype Count

(0, 0) 3 (0, 0) 2(0, 1) 3 (0, 1) 1(1, 0) 2 (1, 0) 5(1, 1) 0 (1, 1) 0

Table 3 The full set of markers

l 7 = l 5 ∪ l 6

Frequency estimate after gene counting iteration

Haplotype 1 10 100 1000

(0, 0, 0, 0) 0.0625 3.15E-5 3.61E-48 0(0, 0, 0, 1) 0(0, 0, 1, 0) 0.3125 0.374968 0.375 0.375(0, 1, 0, 0) 0.125 0.125 0.125 0.125(0, 1, 0, 1) 0.125 0.125 0.125 0.125(0, 1, 1, 0) 0.125 0.125 0.125 0.125(1, 0, 0, 0) 0.0625 0.124968 0.125 0.125(1, 0, 0, 1) 0(1, 0, 1, 0) 0.1875 0.125032 0.125 0.125

Pairing up the first two markers and the last two mark-ers we find the frequencies for the sub haplotypes afterone round of EM given in table 2.

Now, combining the two results we have to consideronly the 9 haplotypes made from the product of {(0, 0),(0, 1), (1, 0)} × {(0, 0), (0, 1), (1, 0)} instead of the full16 haplotypes possible. After one round of EM we getthe results in table 3, column 2.

Before we proceed with gene counting we first elimi-nate the haplotypes (0, 0, 0, 1) and (1, 0, 0, 1) since theyhave zero estimated frequency and we have reduced theoriginal 16 to 7.

Columns 3, 4 and 5 of table 3 show the frequencyestimates after 10, 100 and 1000 gene counting iterationsrespectively. As can be seen the MLE for the frequency

of haplotype (0, 0, 0, 0) goes to zero, but convergenceis rather slow. This illustrates that our method discardsonly a subset of those haplotypes whose MLE is zero.

A Real Example

The program implementing this method is writ-ten in Java, version 1.2.2. It takes its input in theform of standard Linkage parameter and pedigreedata files, although the pointer information, sex andproband status of individuals is ignored. The formatsfor these files can be accessed on the internet athttp://linkage.rockefeller.edu/soft/linkage/. The cur-rent implementation accepts only markers coded asLinkage numbered allele loci, however, as all codominantmarkers, including SNPs, can be coded this way this isnot a severe restriction.

The program combines markers one at a time in theorder listed in the input file which will usually, but notnecessarily, correspond to chromosomal order. At eachstage we eliminate unnecessary haplotypes and finallyrun 2000 EM or gene counting iterations.

For illustration, the program was run on a data set of45 Caucasian individuals genotyped at 26 polymorphicSNPs in the PIK3R1 gene (phosphoinositide-3-kinase,regulatory subunit, polypeptide 1 (p85 alpha)). The dis-tance from the first SNP to the last is about 71 kilo bases.Of the 45 individuals, 1 had data missing at 1 markerand 5 had data missing at 3 markers. The genotypeswere ascertained by sequencing of complete exons andof introns near intron-exon boundaries. This data wasprovided by Myriad Genetics Inc.

Table 4 gives the MLEs for the haplotypes with esti-mated frequency of greater than 1%. In all there were343 haplotypes that had to be considered in the finalgenecounting. The complete analysis took under 2 sec-onds of CPU on a SUN Blade 1000 workstation withtwo 750Mhz processors, although the program madeuse of only 1 processor.

There is a common wild type haplotype in this pop-ulation and for several markers only a single individualhad the non wild type allele.

Discussion

The method presented here inherits all the featuresthat make EM attractive for this problem. We can, for

610 Annals of Human Genetics (2003) 67,608–612 C© University College London 2003

Page 4: Accelerated Gene Counting for Haplotype Frequency Estimation

Fast Gene Counting for Haplotypes

Table 4 Haplotypes with MLE of greater than 1% estimated froma sample of 45 Caucasian individuals typed at 26 SNPs in thePIK3R1 gene

FrequencyHaplotype MLE (%)

CCAGGCTGATAATATCCAGTAAGGCA 38.07CCAAGCTGATAATATCCAGTAAGGCA 8.01CCAGGCTGATAATATCCAGTAAGGCG 5.85CCAGACTGATAATATCCAGTAAGGCA 5.84CTCGGCTGATAATATCCAGTAAGGCA 4.92CTCAGCTGATAATATCCAGTAAGTCA 4.17CTCAGCTGATAATATCCAGTAAAGCA 2.90CTCGGCTGATAATATCCAGTAAAGCA 1.62CTCGACTGATAATATCCAGTAAGGCA 1.46CTCGGCTGATAGTATCCAGTAAGGCA 1.11CTCGGCTGATAATATCCGGTAAGGCA 1.11CTCGGCTGATAATACCCAGTAAGGCG 1.11CTCAGCTGATAATATCCAGTAAGGCG 1.11CTCAGCTGATAATATCCAGTAAATCG 1.11CTCAGCTGATAATATCCAGGAAGGCG 1.11CTCAGCTGACAATATCCAGTAAGGCA 1.11CTCAACTGATAATATCCAGTAGGGCA 1.11CCAGGTTGATAATATCCAGTAAGGCG 1.11CCAGGCTGATGATATCCAGTAAGGCG 1.11CCAGGCTGATAGTATTCAGTAAGGCA 1.11CCAGGCTGATAATATCTAGTAAGGCA 1.11CCAGGCTGATAATATCCAATAAGGCG 1.11CCAGGCGGATAATATCCAGTAAGGCG 1.11CCAGACTGATAATGTCCAGTAAGGCA 1.11CCAGACTGATAATATCCAGTAAGGTG 1.11CCAGACTAATAATATCCGGTAAGGCA 1.11CCAAGCTGATAATATCCAGTGAGGCA 1.11CCAAGCTGATAACATCCAGTAAGTCA 1.11CCAAACTGATAATATCCAGTAAGTTA 1.11CCAAACTGATAATATCCAGTAAGGTA 1.11

example, find the haplotype pairs that best match a givengenotype set, and from the weight which was allocatedto that pair have some measure of confidence in that re-construction. Also, we can allow for missing data, such asdrop outs from the genotyping process, by allowing anypair of alleles present in the sample to match the missingobservation. This, however, does have the drawback ofgreatly decreasing the number of haplotypes that can beeliminated if the amount of missing data is large. Mod-erate amounts of missing data, as in the example aboveare not a problem.

There are limitations also. Generally convergence ofthe EM algorithm can be slow, although this has, inour experience, not been a problem in this context.Also, analyses that use the whole likelihood, not justthe maximum, will not benefit from the acceleration.

The method makes EM faster not more appropriate,and there are occasions when other approaches such asthose of Stephens et al. (2001) or Niu et al. (2002) willbe prefered.

In this example the SNPs are all in the same gene,there is strong disequilibrium between the markers, andthe quality of genotyping is high with the level of miss-ing data low. This sort of data is ideal for this compu-tational method and it has been used in datasets withfar more individuals and over 100 markers (AlexanderGutin, personal communication). A large amount ofmissing data can not only slow the program down, butcan lead to numerical instability so that the results canbecome dependent on the order in which the markersare incorporated. However, the data used here is typicalof that used in fine mapping.

As the number and density of genetic markers, espe-cially SNPs, grows so does the potential for fine mappingof genetic traits provided we can handle this increasein information. The method presented here greatly in-creases the tractability of such datasets.

Acknowledgements

I would like to acknowledge the help of Alexander Gutin,Victor Abkevich and Mark Skolnick, my former colleaguesat Myriad Genetics Inc. where this method was originallydeveloped, and to thank the company for allowing me theuse of the illustrative data set.

References

Ceppellini, R., Siniscalo, M. & Smith, C.A.B. (1955) Theestimation of gene frequencies in a random-mating popu-lation, Annals of Human Genetics 20, 97–115.

Clark, A.G. (1990) inference of haplotypes from PCR-amplified sample of diploid populations, Molecular Biologyand Evolution 7, 111–122.

Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977) Maxi-mum likelihood estimation from incomplete data via theEM algorithm, Journal of the Royal Statistical Society, Series B39, 1–38.

Excoffier, L. & Slatkin, M. (1995) Maximum-likelihood es-timation of molecular haplotype frequencies in a diploidpopulation, Molecular Biology and Evolution 12, 921–927.

Hawley, M. & Kidd, K. (1995) HAPLO: a program usingthe EM algorithm to estimate the frequencies of multi-sitehaplotypes, Journal of Heredity 86, 772–789.

C© University College London 2003 Annals of Human Genetics (2003) 67,608–612 611

Page 5: Accelerated Gene Counting for Haplotype Frequency Estimation

A. Thomas

Long, J.C., Williams, R.C. & Urbanek, M. (1995) An E-Malgorithm and testing strategy for multiple locus haplotypes,American Journal of Human Genetics 56, 799–810.

Niu, T., Qin, Z.S., Xu, X. & Liu, J.S. (2002) Bayesian haplo-type inference for multiple linked single-nucleotide poly-morphims, American Journal of Human Genetics 70, 157–169.

Stephens, M., Smith, N.J. & Donnelly, P. (2001) A new sta-tistical method for haplotype reconstruction from popu-lation data, American Journal of Human Genetics 68, 978–989.

Received: 15 April 2002Accepted: 30 October 2002

612 Annals of Human Genetics (2003) 67,608–612 C© University College London 2003