estimation of hla-a, -b, -drb1 haplotype frequencies using mixed resolution data from a national...

Upload: sm19791716

Post on 03-Mar-2016

2 views

Category:

Documents


0 download

DESCRIPTION

Artigo

TRANSCRIPT

  • EFNV

    CrM

    a Jaeb Center for Health Research, Tampa, Floridab National Marrow Donor Program, Minneapolis, Minneapolisc Zd DLee D

    Rec

    Human Immunology (2007) 68, 950958

    019doiKRD - German National Bone Marrow Donor Registry, Ulm, Germanyepartment of Immunohematology and Bloodtransfusion, Leiden University Medical Center and Europdonor Foundation,iden, The Netherlandsepartment of Oncology, Georgetown University Medical Center, Washington, DC

    eived 13 December 2005; received in revised form 16 September 2007; accepted 5 October 2007

    Summary Large registries of volunteer hematopoietic stem cell donors typed for HLA containpotentially valuable data for studying haplotype frequencies in the general population. Howeverthe usual assumptions for use of the expectation-maximization (EM) algorithm are typicallyviolated in these registries. To avoid this problem, previous studies using registry data havereduced the HLA typings to low-resolution and/or excluded subjects who were selected fortesting on behalf of a specic patient (patient-directed typings). These restrictions, added toavoid bias from selection of nonrepresentative volunteers for higher-resolution typing, havelimited previous results to haplotypes dened at low resolution. In this article we eliminate theneed for such restrictions by formally relaxing the assumptions necessary for the EM algorithm.We show mathematically and through simulation that varying levels of resolution can be incor-porated even if the level of typing resolution is chosen based on the HLA type. This allows useof intermediate and high resolution data from patient-directed typings to extend haplotypefrequency estimates to the allele level for HLA-DRB1. We demonstrate the feasibility of usingthis computationally demanding algorithm on large datasets by applying it to more than 3 millionvolunteers listed in the National Marrow Donor Program Registry. 2007 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc.All rights reserved.

    KEYWORDSEM algorithm;

    HLA-DRB1;

    Haplotype frequenciesstimation of HLA-A, -B, -DRB1 Haplotyperequencies Using Mixed Resolution Data from aational Registry with Selective Retyping ofolunteers

    aig Kollmana,*, Martin Maiersb, Loren Gragertb, Carlheinz Mllerc,ichelle Setterholmb, Machteld Oudshoornd, Carolyn Katovich Hurleye* Corresponding author. Fax: (813) 975-8761.E-mail address: [email protected] (C. Kollman).

    8-8859/$ -see front matter 2007 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.:10.1016/j.humimm.2007.10.009

  • In

    Thbyhudotrahaa ptheha

    thaing(Taofrescaudirthesoof

    try(EMforreswhtiacolstu

    levviopliindrellevstijustiema

    listHLA sritcomter

    Su

    Su

    AnaRegthelanAmrelgrothe

    HL

    Allthemeimp

    Tre

    A

    D

    D

    D

    D

    D

    NpamAa

    hTM

    951HLA-A, -B, -DRB1 haplotype frequenciestroduction

    e National Marrow Donor Program (NMDP) was establishedCongress in 1986 to develop and maintain a registry ofman leukocyte antigen (HLA)typed volunteer unrelatednors for patients in need of a hematopoietic stem cellnsplant [1,2]. Since then more than 5 million volunteersve been recruited and typed for HLA polymorphisms, withrimary focus on HLA-A, -B, and -DRB1. The NMDP Registryrefore offers a valuable resource for studying allele and

    plotype frequencies of these three loci.There are limitations to the Registry HLA data, however,t must be taken into consideration. Typing was done us-a variety of techniques at varying levels of resolutionble 1) as the technology evolved over the 20-year historythe NMDP [36]. Many of the intermediate- and high-olution typings of volunteer donors were performed be-se they were potential matches for a patient (patient-ected typings). Volunteers with common HLA types mayrefore be more likely to have allele level data availablethat the level of resolution is not statistically independentHLA type.Previous analyses of HLA frequencies in the NMDP Regis-have used a version of the expectation-maximization) algorithm [7,8] that requires all subjects to be testedall three loci (HLA-A, -B, -DRB1) at the same level ofolution [912]. Analyses therefore excluded volunteerso were not tested for all three loci (introducing a poten-l selection bias), and DRB1 alleles where known werelapsed to serologic HLA-DR antigens in these previousdies.Adaptations of the EM algorithm to incorporate varyingels of typing resolution and untyped loci have been pre-usly described [1315], but they have assumed (some im-citly) that the level of typing resolution is statisticallyependent of the HLA type. In this article we formallyax that assumption, specifying conditions under which theel of typing resolution can depend on the HLA type andll satisfy the general framework of the EM algorithm. Thisties the use of mixed resolution data that include pa-nt-directed typings to extend haplotype frequency esti-tes to the allele level.We use data from more than 3 million volunteer donorsed in the NMDP Registry to estimate the frequencies ofA-A, -B, -DRB1 haplotypes in various racial/ethnic groups.imulation is also presented to conrm that the EM algo-hm gives valid results even when volunteers with more

    mon HLA types are disproportionately selected for in-mediate- or high-resolution typing.

    ABBREVIATIONS

    EM expectation-maximizationHLA human leukocyte antigenNMDP National Marrow Donor Programbjects and methods

    bjects

    lysis included all domestic volunteers recruited into the NMDPistry from January 1993 through July 2004 who self-reportedir race/ethnicity as African-American/black, Asian/Pacic Is-der, Hispanic, or Caucasian/white. Other groups such as Nativeerican/Alaska Natives were not analyzed here because of theatively smaller sample sizes. All volunteers in the racial/ethnicups indicated above were included in the analysis regardless oflevel of resolution of HLA data available.

    A Typing

    volunteer donors were typed for HLA-A and -B when they joinedNMDP Registry. Most volunteers were typed using serologic

    thods until 1997, when an initiative to use DNA-based testing waslemented. Starting in 1992 many new volunteers were also typed

    able 1 Examples of assignments used to describesults of HLA testings at varying levels of resolution.

    ssignmentAllele(s) possiblypresent Comment

    RB1*0101 0101 Allele resolution: Specicallele is determinedfrom HLA testing

    RB1*01ABa 0101 or 0102 Intermediate resolution:HLA testing cannotdistinguish betweenthese two alleles, butrules out all otherpossibilities.

    RB1*01BV 0101, 0102, 0106or 0107

    Intermediate resolution:Less specic testinggives a larger list ofpossible alleles.

    RB1*01XX 0101, 0102, 0103,0104, 0105,0106, 0107,0108, . . . or0116

    Low resolution: DNAtesting is logicallyequivalent to aserologic typing of DR1(see next row).

    R1 0101, 0102, 0103,0104, 0105,0106, 0107,0108, . . . or0116

    Low resolution: Serologictesting identies theprotein expressed onthe surface of the cell;it could be any one ofthe 16 DRB1 allelesthat encode the DR1protein.

    ote: HLA typing does not always identify the specic allele(s)resent in an individual. Results are used to narrow the list oflleles possibly present in the individual. Each of the assign-ents listed in this table is consistent with the allele DRB1*0101.n individual with this allele could be reported with any of thesessignments depending on the level of resolution used.a The full set of allele letter codes is available online at

    ttp://bioinformatics.nmdp.org/HLA/allele_code_lists.html.hese codes are approved for use in registry typing by the Worldarrow Donor Association [6].

  • usirecaddint

    voltermamesamlevmetoteetaiRegalleme

    taibasvolscrgen

    ethHLAHLADRfor9%

    theandanalevserthewealeonlH

    inallelowallcom

    theDRinctimTab

    evareqblinityfronotparmema

    Est

    Thedathapelsmoandma

    Stetimabiobs

    T ionalo

    Lc Isla3)

    N %)S %)L %)In %)

    A )

    y is uw

    r the

    952 C. Kollman et al.ng molecular biology techniques for HLA-DRB1 at the time ofruitment. Over time the HLA DNA-based testing has incorporateditional reagents, and the resolution has increased from low to

    ermediate (Table 1).Subsequent higher-resolution testing of a selected subset of theunteer donors was obtained in several scenarios. Transplant cen-s requested blood samples from volunteers who they believedy be a potential match for their patient based on the HLA assign-nts listed in the NMDP Registry (patient-directed typings). Theseples were then retyped by the transplant center choosing the

    el of resolution at their discretion subject to minimum require-nts set by the NMDP [16]. HLA typing results were reported backthe NMDP Coordinating Center (regardless of whether the volun-r was considered a suitable match for the patient) where a de-led override algorithm was used to update the information in theistry. Other volunteers were typed at high resolution as part ofle frequency studies or because they carried a novel HLA assign-nt.In summary some volunteer donor HLA assignments were ob-ned by serology at the broad or split level and others by DNA-ed testing at low, intermediate, or allele resolution. Becauseunteers with higher-resolution testing are often selected as de-ibed above, this subset is not necessarily representative of theeral population (i.e., is not random).The level of HLA-DRB1 data available varied according to race/nicity (Table 2). Overall 25% of volunteers were tested for only-A and -B, but more minority individuals have been tested for-DRB1 because these volunteers were given higher priority. HLA-

    B1 alleles were determined for 7% of volunteers ranging from 6%individuals of Asian/Pacic Islander and Caucasian ethnicities tofor those of African-American ethnicity.Table 3 lists the HLA antigens and alleles that are present withinHLA assignments in the Registry. Serologic subdivisions of B14B70 (i.e., splits) were converted to their broad antigens forlysis because these antigens were most often reported at thisel. Because of the unreliability of antibody reagents for A69,ologic A68 and A69 results were converted to the broad A28. Foranalysis described here, cases where HLA-A and/or HLA-B alleles

    re known were collapsed to a low-resolution serologic split equiv-nt [17,18]. The serologic types assigned to alleles are availableine (http://bioinformatics.nmdp.org, listed under the headingsLA Resources and NMDP Search Determinants).All 250 HLA-DRB1 alleles present in the NMDP Registry were usedanalysis (Table 3). The NMDP has developed an extensive set ofle codes to allow HLA testing laboratories to report results of- and intermediate-resolution typings [18]. This allows the list ofpossible alleles present in a volunteer donor to be stored in theputerized Registry. For example, an HLA laboratory will report

    able 2 Level of resolution available for volunteers in the Natcus.a,b

    evel of DRB1 typingAfrican-American/black(N 412,861)

    Asian/Paci(N 359,42

    one 37,503 (9%) 40,601 (11erologyc 1,600 (1%) 2,853 (1ow-resolution DNA 53,015 (13%) 44,561 (12termediate-resolutionDNA

    283,053 (69%) 251,433 (70

    llele(s) determined 37,690 (9%) 19,975 (6%a Numbers do not include 1,259,743 volunteers whose race/ethnicitith sample sizes too small for analysis.b Each volunteer is classied according to the lowest resolution foc Serologic testing is considered low resolution.code DRB1*01AB to denote that the individual possesses eitherB1*0101 or DRB1*0102 (but it is unknown which one). The allelesluded in the code are based on the HLA alleles known at thee of the typing test interpretation. More examples are given inle 1.The accuracy of the HLA assignments in the Registry has beenluated over time. The laboratories performing the testing wereuired to be accredited, and the majority of testing incorporatedded quality control samples. For DRB1 the accuracy of the qual-control samples tested at the time of recruitment has improvedm an overall error rate of 1.6% in 1992 [19] to 0.06% today (datashown). Also the accuracy of HLA-A, -B serologic typing com-ed with DNA-based assignments [20] and a comparison of recruit-nt typing to higher-resolution typing of volunteers potentiallytching a searching patient have been reported [21].

    imation of Haplotype Frequencies

    EM algorithm [7] is a very general principle for handling missinga in statistical analysis. As applied to estimation of multilocuslotype frequencies, this algorithm has been described in detailewhere [8] and is briey summarized here (also see Appendix forre details). This procedure involves two steps at each iterationis repeated until convergence of the haplotype frequency esti-

    tes is obtained.

    p 1: expectation (E-step). The current haplotype frequency es-ates are used to calculate for each subject the conditional prob-lity of each possible genotype (pair of haplotypes) given theerved HLA data available. These conditional probabilities aren used to update the current estimates of genotype frequencies.

    p 2: maximization (M-step). The updated genotype frequenciesused to update the current estimates of haplotype frequencies by

    erting the equation given in Assumption A1 in the Appendix.For the above procedure to be a valid application of the EMorithm, some mathematical assumptions are required about howdata were obtained. When all subjects are typed at the same

    el of resolution, or when the level of typing resolution is statis-ally independent of the HLA type, it is relatively straightforwardshow that this procedure satises the general framework of thealgorithm. However, these assumptions would preclude the usellele-level data from patient-directed typings, as volunteers arechosen at random. In the Appendix we relax these assumptionscifying conditions under which the level of typing resolution mayend on the HLA type without violating the EM principle.

    l Marrow Donor Program (NMDP) Registry at the HLA-DRB1

    nder Hispanic(N 449,844)

    Caucasian/white(N 2,361,208)

    Total(N 3,583,336)

    47,771 (11%) 773,172 (33%) 899,047 (25%)2,466 (1%) 68,379 (3%) 75,298 (2%)

    47,518 (11%) 171,538 (7%) 316,632 (9%)322,175 (72%) 1,196,807 (51%) 2,053,468 (57%)

    29,914 (7%) 151,312 (6%) 238,891 (7%)

    nknown and 228,507 volunteers from other racial/ethnic groups

    two DRB1 alleles.the

    Steareinv

    algthelevtictoEMof anotspedep

  • quecanwhat

    Sim

    Tounbavolsimqularbytr

    maweexthemedipwo

    T sis.a,b

    H24

    H

    5775

    H gistr

    es)38;

    824;

    147

    47;42;

    N A-DR

    Abo

    vin

    953HLA-A, -B, -DRB1 haplotype frequenciesTo account for variations by race/ethnicity [9], haplotype fre-ncy estimates were calculated separately for African-Ameri-s/blacks, Asian/Pacic Islanders, Hispanics, and Caucasians/ites based on each volunteers self-reported racial/ethnic groupthe time of recruitment.

    ulation

    demonstrate the validity of the EM algorithm when vol-teers are selectively chosen for higher-resolution typingsed on their HLA type, we simulated a virtual registry ofunteer donors. The true haplotype frequencies in thisulated population were taken from the estimated fre-

    encies for Caucasian individuals (chosen because of theger sample size). Two million volunteers were simulatedrandomly selecting two independent haplotypes from theue distribution.Simulated volunteers initially had their HLA typespped to low resolution (serologic equivalent). Volunteersre then chosen for intermediate-resolution typing (seeample codes in Table 1) with probability proportional toir true low-resolution phenotype frequency. The inter-diate resolution results were generated by simulating theloid probe pattern and resulting allele ambiguities whichuld be generated by standard oligo-based HLA typing kits

    able 3 HLA-A, -B antigens and DRB1 alleles used in the analy

    LA-A1 2 3 11 23 (9)

    31 (19) 32 (19) 33 (19) 34 (10) 3680LA-B

    7 8 13 14 1841 42 44 (12) 45 (12) 4652 (5) 53 54 (22) 55 (22) 56 (22)62 (15) 63 (15) 67 70 7382 83LA-DR HLA-DRB1 alleles present in NMDP Re

    1 01010108 (8 alleles)3 03010308; 03100317; 0320 (17 allel4 04010419; 04220427; 04310436; 047 0701; 07030705; 0707 (5 alleles)8 08010814; 08170818; 0820, 0822; 09 09010902 (2 alleles)

    10 1001 (1 allele)11 (5) 11011134; 11361141; 1143; 1145; 112 (5) 12011208 (8 alleles)13 (6) 13011329; 13311332; 13351343; 1314 (6) 14011429; 14321435; 1439; 14411415 (2) 15011507; 15091512 (11 alleles)16 (2) 16011605; 16071608 (7 alleles)

    ote: There were 21 HLA-A antigens, 42 HLA-B antigens, and 250 HLa Numbers in parentheses denote broad antigens. For example a

    66, but it is unknown which one. Serologic subdivisions of B14 and Because these antigens were most often reported at this level. A68 af antibody reagents for A69.b More details of HLA nomenclature are available at www.anthonyc Only those alleles that encode different amino acid sequences

    olunteer by high-resolution typing results, but many are most cocluded within an allele code).]. A subset of these simulated volunteers was then chosenfurther high-resolution typing (i.e., DRB1 allele(s) deter-

    ned) with probability proportional to their intermediate-el resolution typing frequency. The resulting registryrefore contained a mixture of low- (60%), intermediate-%), and high-resolution (16%) typings, with the more com-n HLA types more likely to have intermediate- or high-olution data available.The EM algorithm as described above was run on thisulated registry, and the resulting haplotype frequencyimates were compared with the true frequencies.

    sults

    ulated Data

    spite the fact that simulated volunteers with more com-n HLA types were more likely to have high-resolutionel data available, the EM algorithm did not overestimatefrequencies of the common haplotypes. There were 178

    plotypes in the simulation with a true frequency of greatern 1 in 1,000 (common haplotypes), combining to com-se 56.4% of the overall population. Because of the selec-e retyping, these 178 haplotypes were overrepresented

    (9) 25 (10) 26 (10) 29 (19) 30 (19)43 66 (10) 68 (28) 69 (28) 74 (19)

    27 35 37 38 (16) 39 (16)47 48 49 (21) 50 (21) 51 (5)(17) 58 (17) 59 60 (40) 61 (40)(15) 76 (15) 77 (15) 78 81

    yc

    04400441; 04430444; 0450 (37 alleles)

    0826 (20 alleles)

    1148 1151; 11531154 (47 alleles)

    1349, 13511352; 1356; 1359; 13611362 (48 alleles)1444; 14471448 (39 alleles)

    B1 alleles from 13 different HLA-DR antigen groups.g result of HLA-A10 denotes the presence of A25, A26, A34, or.e., splits) were converted to their broad antigens for analysis69 were similarly converted to A28 because of the unreliability

    n.org.uk/higcluded. Each DRB1 allele listed was identied in at least onenly listed within intermediate-resolution typing results (i.e.,[22formilevthe(24mores

    simest

    Re

    Sim

    Demolevthehathapritiv

    typin70 (ind A

    nolaare inmmo

  • amcomnoestquin

    Ac

    Thestbio10nictip

    to self-reported race/ethnicity. A1-B8-DRB1*0301 was the

    AmPatioFigmoamics

    levwit(sepodogroranmeAmnolowDRab

    the(co

    Figtee1,0subreptiocieideFigany

    T to se

    R y

    1

    avail

    954 C. Kollman et al.ong simulated volunteers with high-resolution typing,prising 90% of this subgroup. This overrepresentation did

    t bias the results of the EM algorithm, which gave animate of 56.6% for the sum of these 178 haplotype fre-encies. The lack of bias for individual haplotypes is shownFigure 1.

    tual Data

    e complete list of HLA-A, -B, -DRB1 haplotype frequencyimates by race/ethnicity is available online (http://informatics.nmdp.org/em-haplotype). Table 4 shows themost common haplotype frequencies in each racial/eth-group. Although some haplotypes were common to mul-le groups, there were considerable differences according

    ure 1. Simulated registry with selective retyping of volun-rs. Common haplotypes with a frequency greater than 1 in00 made up 56% of the simulated population and 90% of thegroup chosen for high-resolution typing. Despite this over-resentation, there was no systematic bias for the expecta-n-maximization (EM) algorithm to overestimate the frequen-s of the common haplotypes. Diagonal represents the line ofntity. Note that both axes are on a reverse logarithmic scale.ure excludes less common haplotypes where evaluation ofbias is confounded by the larger statistical error (variance).

    able 4 Most common HLA-A, -B, -DRB1 haplotypes according

    ank

    African-American/black Asian/Pacic Islander

    A B DRB1 Frequency A B DRB1 Frequenc

    1 30 42 0302 1 in 70 33 58 0301 1 in 522 1 8 0301 1 in 84 33 44 0701 1 in 623 33 53 0804 1 in 138 2 46 0901 1 in 674 68 58 1201 1 in 139 24 52 1502 1 in 825 3 7 1501 1 in 157 33 44 1302 1 in 876 36 53 1101 1 in 163 30 13 0701 1 in 887 34 44 1503 1 in 182 33 58 1302 1 in 898 2 44 0401 1 in 194 1 57 0701 1 in 949 30 42 0804 1 in 204 11 75 1202 1 in 1060 68 70 0301 1 in 205 24 7 0101 1 in 123a A full list of haplotype frequency estimates by race/ethnicity isst common haplotype in Caucasians and the second mostmon in both Hispanics and African-Americans, with es-ated frequencies of 1 in 16, 1 in 60, and 1 in 84, respec-ely. This haplotype was the 36th most common in Asian/cic Islanders with an estimated frequency of 1 in 336.-B7-DRB1*1501 was also common in Caucasians (1 in 33),panics (1 in 86), and African-Americans (1 in 157) but wass so in Asian/Pacic Islanders (1 in 463). A33-B58-B1*0301 was the most common haplotype in Asian Pacicnders, with an estimated frequency of 1 in 52, but oc-

    rred in less than 1 in 2,000 haplotypes in each of the otheree groups.There was also considerable variation in overall HLA poly-rphism among the racial/ethnic categories. The medianplotype frequency (the value at which there is a 50%ance that a randomly selected haplotype would have aater frequency) was approximately 1 in 2,100 for African-ericans, 1 in 1,400 for Hispanics, 1 in 1,100 for Asian/cic Islanders, and 1 in 950 for Caucasians. The distribu-n of haplotype frequencies by race/ethnicity is shown inure 2. These data show that the HLA-A, -B, -DRB1 poly-rphism is highest among African-Americans and lowestong Caucasians, with Asian/Pacic Islanders and Hispan-falling intermediate.Extending haplotype frequency estimates to the alleleel for HLA-DRB1 reveals an additional polymorphismhin the previously dened low-resolution HLA-A, -B, -DRrologic) groups. One simple way to quantify this increasedlymorphism is by the estimated probability that two ran-mly chosen haplotypes within the same HLA-A, -B, -DRup would have different DRB1 alleles. This probabilityged from 20% (Caucasians) to 28% (Hispanics), with inter-diate values for Asian/Pacic Islanders (25%) and African-ericans (27%). The increased allelic polymorphism wast universal, however, across all haplotypes. Within the-resolution level haplotype A1-B8-DR3, for example, theB1*0301 allele comprised more than 99% of the total prob-ility mass in each racial/ethnic group.Although there were 250 HLA-DRB1 alleles identied inNMDP Registry (Table 3), only 81 had an allele frequencyllapsed over HLA-A and -B) greater than 1 in 10,000 in any

    lf-reported racial/ethnic group.a

    Hispanic Caucasian/white

    A B DRB1 Frequency A B DRB1 Frequency

    29 44 0701 1 in 54 1 8 0301 1 in 161 8 0301 1 in 60 3 7 1501 1 in 332 35 0802 1 in 64 2 44 0401 1 in 483 7 1501 1 in 86 2 7 1501 1 in 50

    68 39 0407 1 in 113 29 44 0701 1 in 642 39 0407 1 in 127 2 62 0401 1 in 81

    33 14 0102 1 in 129 1 57 0701 1 in 8924 39 1406 1 in 139 3 35 0101 1 in 8930 18 0301 1 in 144 2 8 0301 1 in 1122 35 0407 1 in 152 2 60 1302 1 in 123

    able online at http://bioinformatics.nmdp.org/em-haplotypemocomtimtivPaA3HislesDRIslacuthr

    mohachgre

  • racgre

    Di

    Wetharesdeoveingthefresimamspima

    betiovionusemu

    riteveHA1142EMpowittesge3.5Sun

    hapset

    mates for haplotypes dened at low resolution. Our data revealtheHLpo

    anmaciehaa sestBa

    valtridonstrmirecthepreto

    cielowtieoftabwhimarebedo

    HL[24locuseis aestsidbene

    nebetyppothelesphtypThqutounvioallrar

    the

    Figeactypthelog

    955HLA-A, -B, -DRB1 haplotype frequenciesial/ethnic group, and only 49 had an allele frequencyater than 1 in 1,000.

    scussion

    have shown, both mathematically and by simulation,t the EM algorithm can be applied to a dataset of mixedolution HLA typings, even when the level of resolutionpends on the HLA type. We intentionally exaggerated thersampling of common haplotypes for high-resolution typ-in our simulation to demonstrate that this does not skewresults of the EM algorithm. The 178 haplotypes with a

    quency greater than 1 in 1,000 comprised 57% of theulated population but represented 90% of the haplotypesong individuals selected for high-resolution typing. De-te this oversampling, the EM algorithm did not overesti-te the frequencies of these haplotypes (Figure 1).Algorithms allowing for varying levels of resolution haveen previously described [1315], but the level of resolu-n was assumed to be independent of the HLA type. Pre-us results have therefore been limited to haplotypes de-ed at low resolution. Relaxing this assumption allows theof allele-level HLA data available in volunteer registries,ch of which is obtained through patient-directed typings.We also show that this computationally demanding algo-hm can be successfully implemented on large datasets,n with the greatly expanded list of HLA-DRB1 alleles. ThePLO program, for example, is limited to a maximum of4 haplotypes [13]. In our implementation there are 21 250 220,500 total haplotypes. At each iteration, thealgorithm requires calculation of the frequency of every

    ssible genotype for every individual. A single individualh the low-resolution result A10,19; B15,22 (DRB1 notted), for example, has more than 9.5 million possiblenotypes. Nevertheless our program processed a dataset ofmillion individuals in 5.5 hours running on a cluster of veFire V100 servers (550 MHz Ultra SPARC IIi 2GB RAM).

    This is the rst study to report estimates of multi-locus HLAlotype frequencies at the allele level for DRB1 from a data-of this size. Previous reports have given frequency esti-

    ure 2. HLA polymorphism by race/ethnicity. Height ofh curve denotes the percentage of HLA-A, -B, -DRB1 haplo-es in the population with frequency less than the value onhorizontal axis. Note that the horizontal axis is on a reverse

    arithmic scale.increased allelic polymorphism within low-resolutionA-DR groups. This increased polymorphism is clinically im-rtant to identify the optimal stem cell donor [23,24].One limitation of the EM algorithm is that it is not guar-teed to nd the global maximum likelihood. The algorithmy settle on a locally maximum likelihood set of frequen-s, which is different from global maximum. The problems been addressed by varying initial conditions [25,26], butatisfactory general solution to this problem has yet to beablished. Recent work has compared the EM algorithm toyesian methods for haplotype estimation [2729].HLA haplotype population frequency estimates are auable tool in the management of volunteer donor regis-es. They have been used to project how many volunteernors would be needed to achieve a certain probability ofding an HLA-matched donor. These data have demon-ated the challenges in providing HLA-matched donors fornority patients and the special need for minority-focusedruitment in volunteer registries [9,30,31]. The utility ofse tools could be greatly enhanced by using the worksented here to extend the denition of an HLA matchthe allele level for DRB1.Another useful application of these haplotype frequen-s is to predict the probability that a volunteer typed ator intermediate resolution would match a specic pa-

    nt at high resolution. The initial search of a donor registryen reveals several potential matches for a patient. Theility to distinguish which of these volunteers are likely andich are unlikely to match the patient can be extremelyportant when resources and time to perform HLA testinglimited. Prediction tools at the low-resolution level have

    en previously described for unrelated [10] and relatednors within the extended family [32].Research has also shown that allele level mismatches atA-A, -B, and -C can also affect transplantation outcomes,33,34]. Future research may discover other importanti. In principle, the algorithm presented here could bed to extend to the allele level for these loci as well. Therepractical limit, however, to how far haplotype frequencyimation can be expanded using this algorithm. The con-eration of additional loci dramatically increases the num-r of possible haplotypes and therefore the amount of datacessary for reliable estimation.Although this version of the EM algorithm reduces thecessary assumptions, there are still limitations that mustconsidered in the interpretation of these data. Serologicing has a relatively high error rate, especially in minoritypulations [20]. Our results may therefore underestimateincidence of frequently mistyped antigens (those with

    s reliable antisera) and understate the level of polymor-ism present. In addition the HLA-DRB1 codes from earlierings excluded alleles that were unknown at the time.ese results may therefore underestimate the true fre-encies of recently discovered alleles. Without going backthe original data of tested DNA polymorphisms [22], it isclear how these codes could be reinterpreted withoutlating Assumption A2 in the Appendix. The fact that theseeles were only recently discovered suggests they may bee in the general population; but this remains to be seen.Another possible violation of Assumption A2 may occur iftyping laboratory reported allele codes in an inconsis-

  • tent manner. If, for example, DRB1*0101 were always re-pofamcluovecodex

    repmebebroriuuntheof

    iningwhandisDRsepB7idefordivThpre

    serciedaabritlowthe

    canpretioHLtra

    Ac

    WedoWein

    ApAlg

    No

    Lean

    general population. Let (i, j) be the genotype comprised ofhapo{} sbra

    anmuunchthetypge

    As

    Thranmoindwhexmamawhtiopre

    areeq

    gij

    elsofPr

    De

    Frotyptha

    L(N

    whha

    geweEMatiN,

    hi(n

    956 C. Kollman et al.rted at the allele level, but any other allele in the DR1ily were always reported as DRB1*01XX (a code that in-des DRB1*0101 as a possibility), the EM algorithm wouldrestimate the frequency of DRB1*0101. The reportede needs to accurately reect which alleles have been

    cluded based on the probes or primers used.The groupings used in this analysis were based on self-orted race/ethnicity collected at the time of recruit-nt, but the scientic validity of this categorization hasen questioned [35,36]. The populations dened by thesead categories are not in exact Hardy-Weinberg equilib-m [11], but the EM algorithm has been shown to be robustder moderate deviations from this assumption [37]. None-less it is a gross oversimplication to describe the variationsHLA among human beings in such broadly dened categories.The presence of low- or intermediate-resolution typingsthe dataset can result in some haplotype frequencies be-unidentiable. Suppose, for example, that all subjects

    o possess an A1 antigen, a B7 antigen, and a DRB1*0101d/or 0102 allele were typed with probes that could nottinguish those two DRB1 alleles (e.g., coded asB1*01AB; see Table 1). Then it would not be possible toarate the frequencies of A1, B7, DRB1*0101 versus A1,

    , DRB1*0102. Only the sum of their frequencies would bentiable. If the EM algorithm were initiated with the uni-m distribution, then the nal frequency estimates wouldide this sum equally among the confounded haplotypes.is could lead to overestimation of the total diversitysent in the population.As would be the case even if haplotypes could be ob-ved directly, the ability to accurately estimate frequen-s of rare haplotypes is limited by the sample size of thetaset. The statistical margin of error (and therefore theility to estimate rare haplotypes) when using the EM algo-hm will also depend on the proportion of subjects typed at

    versus intermediate versus high resolution, as well asproportion who are typed only for a subset of the loci.

    Despite these limitations, the algorithm described herebe used to approximate the level of polymorphism

    sent in human populations. This has important implica-ns for the ability of volunteer registries to rapidly identifyA-matched, unrelated donors for patients in need of ansplant.

    knowledgments

    are grateful to each volunteer who has registered tonate stem cells to offer a stranger a second chance at life.thank Bonnie Olson and Chris Hansen for their assistance

    the preparation of this manuscript.

    pendix: The expectation-maximization (EM)orithm

    tation

    t K be the number of HLA-typed subjects in the datasetd take hi to be the frequency of the i

    th haplotype in theplotypes i and j and let gij be its frequency in the generalpulation. Take Gk to be the genotype of the k

    th subject. Prtands for the probability of the event denoted inside theckets and E {} denotes the expected value.Let G be the set of all possible genotypes in a populationd let P be a partition of G. That is, P is a collection oftually exclusive sets of genotypes, {S1, . . ., Sp}, whoseion is G. For each subject in the dataset, a partition, Pk, isosen and the results of the HLA typing tell us which one of

    sets in that partition contains the subjects true geno-e. Thus, we equate each HLA typing result with the set of

    notypes, Sk, possibly present in the kth subject.

    sumptions

    e level of typing resolution for the kth subject, Pk, may bedom and can depend on the subjects genotype. Further-re, the resolution chosen for one subject need not beependent of those for other subjects. The decisionether to retype a volunteer donor at higher resolution, forample, may depend on whether he/she is a potentialtch for a patient or whether there are other (potential)tches in the Registry. We only require that the decisionether to include a set of possible genotypes in the parti-n cannot depend on which of those genotypes is actuallysent.More formally, we assume that:A1. The K genotypes are jointly independent and subjectssampled at random from a population in Hardy-Weinberg

    uilibrium so that

    PrGk (i, j)2hi hj if i j

    hi2 if i j

    for all k.

    A2. The joint probability distribution for the random lev-of typing resolution, Pkk1K , is such that for any collectionk sets of genotypes, Skk1K , the conditional probabilitySkPk for all kGkk1K is constant over each Sk.

    rivation

    m Assumption A1, if the genotypes (and therefore haplo-es) were directly observable, the log likelihood would bet for a multinomial distribution:

    , h) log(2K

    N1, . . . , NH)

    iNi log(hi)

    ere N Nii1H is the number of occurrences of the ith

    plotype in the dataset i Ni 2K.Instead, we actually observe S Skk1K , a list of possiblenotypes for each subject. Suppose after the nth iterationhave the estimates hn hini1H . The general form of thealgorithm is to replace these estimates at the next iter-on with the values hn1 hin1i1H that maximize ELhS,hn over h [7].This is achieved by taking

    1)1

    2KE{Ni|S, h

    (n)}1

    2K {

    k1

    K

    2 Pr{Gk (i, i)|S, h(n)}

  • (n)1 K

    Pr{

    ji

    clei,ji,j

    Prfor

    hi(n

    Re

    1.alltraeropo

    dufreproing(K

    pro

    gij(n

    the

    between haplotype and genotype frequencies in AssumptionA1

    hi(n

    the

    sortiorepfor

    Re

    [1

    [2

    [3

    [4

    [5

    [6

    [7

    [8

    [9

    [10

    [11

    [12

    [13

    [14

    957HLA-A, -B, -DRB1 haplotype frequenciesji

    Pr{Gk (i, j)|S, h }2K

    k1

    2

    Gk (i, i)h(n)} Pr{Sl Pl lGk (i, i)Pr{Gk Sk and Sl Pl l|h(n)}

    Pr{Gk (i, j)|h(n)} Pr{Sl Pll|Gk (i, j)}

    Pr{Gk Sk and Sl Pll|h(n)} The conditional probabilities PrGk i,jS,hn arearly zero for i,jSk, and must therefore sum to 1 overSk. By Assumption A2, the quantity PrSlPll|Gk } is constant for all i,jSk.The common value of the ratio

    PrSl PllGk (i, j)Gk Sk and Sl Pllh(n)i,jSk must therefore equal

    l,mSkPrGk l,mhn1.

    We therefore have:

    1)1

    2K k:(i,i)Sk 2 Pr{Gk (i, i)|h

    (n)}

    (l,m)Sk

    Pr{Gk (l,m)|h(n)}

    ji

    k:(i,j)Sk

    Pr{Gk (i, j)|h(n)}

    (l,m)Sk

    Pr{Gk (l,m)|h(n)}

    1

    2K k:(i,i)Sk2 [hi

    (n)]2

    (l,m)Sk

    lm hl(n)hm

    (n)

    ji

    k:(i,j)Sk

    2 hi(n)hj

    (n)

    (l,m)Sk

    lm hl(n) hm

    (n)

    1

    Kj k:(i,j)Skhi

    (n)hj(n)

    (l,m)Sk

    lm hl(n) hm

    (n)

    where lm 1 if lm

    2 if lm.

    marks

    For subjects who are heterozygous at and fully typed forthree loci, Sk contains four possible genotypes. In con-st, if the subject is not typed for HLA-DRB1 and is het-zygous at both HLA-A and -B, then Sk contains 125,000ssible genotypes.2. As is often the case with the EM algorithm, this proce-re has an intuitive interpretation. The current haplotypequency estimates are used to calculate the conditionalbability of a genotype (i, j) given the observed HLA typ-, Sk. The probability mass associated with each individual1) is allocated to each of the possible genotypes (i, j) inportion to this estimated conditional probability:

    1)1

    K k:(i,j)Skij hi

    (n)hj(n)

    (l,m)Sk

    lm hl(n)hm

    (n).

    The haplotype frequencies are then updated according tose new genotype frequencies by inverting the relation:

    1) gii(n1)

    1

    2ji gij(n1).

    Note that these two steps are equivalent to the formula innal line of the above derivation.

    3. Assumption A2 is a requirement of noninformative cen-ing and basically states that the event SkPk is condi-nally independent of Gk given GkSk. This requires thatorting of any HLA codes be done in a consistent mannereach subject.

    ferences

    ] McCullough J, Hansen J, Perkins H, Stroncek D, Bartsch G. TheNational Marrow Donor Program: How it works, accomplish-ments to date. Oncology 1989;3:63.

    ] Zumwalt Admiral ER, Howe CWS. The origins and developmentof the National Marrow Donor Program. Leukemia 1993;7:1122.

    ] Kimura A, Dong RP, Harada H, Sasazuki T. DNA typing of HLAclass II genes in B-lymphoblastoid cell lines homozygous forHLA. Tissue Antigens 1992;40:5.

    ] Ng J, Hurley CK, Baxter-Lowe LA, Chopek M, Coppo PA, HeglandJ, Kukuruga D, Monos D, Rosner G, Schmeckpeper B, Yang SY,Dupont B, Hartzman RJ. Large-scale oligonucleotide typing forHLA-DRB1/3/4 and HLA-DQB1 is highly accurate, specic, andreliable. Tissue Antigens 1993;42:473.

    ] Hurley CK, Maiers M, Ng J, Wagage D, Hegland J, Baisch J, et al.Large-scale DNA-based typing of HLA-A and HLA-B at low reso-lution is highly accurate, specic, and reliable. Tissue Antigens2000;55:352.

    ] Bochtler W, Maiers M, Oudshoorn M, Marsh SGE, Raffoux C,Mueller C, Hurley CK. World Marrow Donor Association guide-lines for use of HLA Nomenclature and its validation in the dataexchange among hematopoietic stem cell donor registries andcord blood banks. Bone Marrow Transplant 2007;39:737.

    ] Dempster AP, Laird NM, Rubin DB. Maximum likelihood fromincomplete data via the EM algorithm [with discussion]. J R StatSoc 1977;B39:1.

    ] Long JC, Williams RC, Urbenek M. An EM algorithm and testingstrategies for multiple locus haplotypes. Am J Hum Genet 1995;56:799.

    ] Beatty P, Mori M, Milford E. Impact of racial genetic polymor-phism upon the probability of nding an HLA-matched donor.Transplant 1995;60:778.

    ] Mori M, Graves M, Milford E, Beatty P. Computer program topredict likelihood of nding an HLA-matched donor: Methodol-ogy, validation, and application. Biol Blood Marrow Transplant1996;2:133.

    ] Mori M, Beatty PG, Graves M, Boucher KM, Milford EL. HLA geneand haplotype frequencies in the North American population:The National Marrow Donor Program donor registry. Transplant1997;64:1017.

    ] Boucher K, Mori M, Milford E, Beatty PG. Estimation of HLA-A,-B, -DR haplotype frequencies in ve racial groups representedin the NMDP donor le. In: Gjertson W, Terasaki PI (eds). HLA1998. Los Angeles, CA: UCLA Tissue Typing Laboratory 1998;57.

    ] Hawley ME, Kidd KK. HAPLO: A program using the EM algorithmto estimate the frequencies of multisite haplotypes. J Hered1995;86:409.

    ] Mller CR, Ehninger G, Goldman SF. Gene and haplotype fre-quencies for the loci HLA-A, HLA-B, and HLA-DR based on over13,000 German blood donors. Hum Immunol 2003;64:137.

  • [15] Gourraud, PA, Lamiraux P, El-Kadhi N, Raffoux C, Cambon-Thomsen A. Inferred HLA haplotype infromation for donorsfrom hematopoietic stem cells donor registries. Hum Immunol2005;66:563.

    [16] Hurley CK, Baxter Lowe LA, Logan B, Karanes C, Anasetti C,Weisdorf D, Confer DL. National Marrow Donor Program HLAmatching guidelines for unrelated marrow transplants. BiolBlood Marrow Transplant 2003;9:610.

    [17] Schreuder GMT, Hurley CK, Marsh SGE, Lau M, Fernndez-ViaM, Noreen HJ, et al. The HLA dictionary 2004: A summary ofHLA-A,-B, -C, -DRB1/3/4/5, and -DQB1 alleles and their asso-ciation with serologically dened HLA-A, -B, -C, -DR and -DQantigens. Hum Immunol 2005;66:170.

    [18] Hurley CK, Setterholm M, Lau M, Pollack MS, Noreen H, HowardA, et al. Hematopoietic stem cell donor registry strategies forassigning search determinants and matching relationships.Bone Marrow Transplant 2004;33:443.

    [19] Ng J, Hurley CK, Carter C, Baxter-Lowe LA, Bing D, Chopek M,et al. Large-scale DRB and DQB1 oligonucleotide typing for theNMDP Registry: Progress report from year 2. Tissue Antigens1996;47:21.

    [20] Noreen H, Yu N, Setterholm M, Ohashi M, Baisch J, Endres R, etal. Validation of DNA-based HLA-A and HLA-B testing of volun-teers for a bone marrow registry through parallel testing withserology. Tissue Antigens 2001;57:221.

    [21] Hurley CK, Fernandez Vina M, Setterholm M. Maximizing opti-

    [22

    [23

    [24

    [25] Excofer L, Slatkin M. Maximum-likelihood estimation of mo-lecular haplotype frequencies in a diploid population. Mol BiolEvol 1995;12:921.

    [26] Kalinowski ST, Hedrick PW. Estimation of linkage disequilib-rium for loci with multiple alleles: Basic approach and an ap-plication using data from bighorn sheep. Heredity 2001;87:698.

    [27] Stephens M. Smith NJ, Donnelly P. A new statistical method forhaplotype reconstruction from population data. Am J HumGenet 2001;68:978.

    [28] Zhang S, Pakstis A, Kidd K, Zhao H. Comparisons of two methodsfor haplotype reconstruction and haplotype frequency estima-tion from population data [letter]. Am J Hum Genet 2001;69:906.

    [29] Stephens M, Smith NJ, Donnelly P. Reply to Zhang et al. [let-ter]. Am J Hum Genet 2001;69:912.

    [30] Beatty PG, Boucher KM, Mori M, Milford EL. Probability of nd-ing HLA-mismatched related or unrelated marrow or cord blooddonors. Human Immunol 2000;61:834.

    [31] Kollman C, Abella E, Baitty PG, Chakraborty R, Christiansen CL,Hartzman RJ, et al. Assessment of optimal size and compositionof the U.S. National Registry of hematopoietic stem cell do-nors. Transplant 2004;78:89.

    [32] Schipper RF, DAmaro J, Oudshoorn M. The probability of nd-ing a suitable related donor for bone marrow transplantation inextended families. Blood 1006;87:800.

    [33] Petersdorf EW, Longton GM, Anasetti C, Mickelson EM, McKin-

    [34

    [35

    [36

    [37

    958 C. Kollman et al.mal hematopoietic stem cell donor selection from registries ofunrelated adult volunteers. Tissue Antigens 2003;61:415.

    ] Maiers M, Hurley CK, Perlee L, Fernandez-Vina M, Baisch J,Cook D, et al. Maintaining updated DNA-based HLA assignmentsin the National Marrow Donor Program bone marrow registry.Rev Immunogenet 2001;2:449.

    ] Petersdorf EW, Kollman C, Hurley CK, Dupont B, Nademanee A,Begovich AB, et al. Effect of HLA class II gene disparity onclinical outcome in unrelated donor hematopoietic cell trans-plantation for CML: The U.S. National Marrow Donor Programexperience. Blood 2001;98:2922.

    ] Flomenberg N, Baxter-Lowe LA, Confer D, Fernandez-Vina M,Filipovich A, Horowitz M, et al. Impact of HLA class I and classII high resolution matching on outcomes of unrelated donorbone marrow transplantation: HLA-C mismatching is associatedwith a strong adverse effect on transplant outcome. Blood2004;104:1923.ney SK, Smith AG, et al. Association of HLA-C disparity withgraft failure after marrow transplantation from unrelated do-nors. Blood 1997;89:1818.

    ] Morishima Y, Sasazuki T, Inoko H, Juji T, Akaza T, Yamamoto K,et al. The clinical signicance of human leukocyte antigen(HLA) allele compatibility in patients receiving a marrow trans-plant from serologically HLA-A, HLA-B, and HLA-DR matchedunrelated donors. Blood 2002;99:4200.

    ] Witzig R. The medicalization of race: Scientic legitimizationof a awed social construct. Ann Intern Med 1996;125:675.

    ] McKenzie KJ, Crowcroft NS. Race, ethnicity, culture and sci-ence. Br Med J 1994;309:286.

    ] Single RM, Meyer D, Hollenbach JA, Nelson MP, Noble JA, ErlichHA, Thomson G. Haplotype frequency estimation in patientpopulations: The effect of departures from Hardy-Weinbergproportions and collapsing over a locus in the HLA region.Genet Epidemiol 2002;22;186.

    Estimation of HLA-A, -B, -DRB1 Haplotype Frequencies Using Mixed Resolution Data from a National Registry with Selective Retyping of VolunteersIntroductionSubjects and methodsSubjectsHLA TypingEstimation of Haplotype FrequenciesStep 1: expectation (E-step)Step 2: maximization (M-step)

    SimulationResultsSimulated DataActual Data

    DiscussionAppendix: The expectation-maximization (EM) AlgorithmNotationAssumptionsDerivationRemarks

    AcknowledgmentsReferences