a high-resolution survey of deletion polymorphism in the ... · a high-resolution survey of...

A High-Resolution Survey of Deletion Polymorphism in

the Human Genome–Supplementary Information

Donald F. Conrad∗, T. Daniel Andrews#, Nigel P. Carter#,

Matthew E. Hurles#, and Jonathan K. Pritchard∗∗

∗Department of Human Genetics

The University of Chicago.#Wellcome Trust Sanger Institute

Hinxton, Cambridge, UK.

October 5, 2005

This document and the accompanying spreadsheets include additional methods and data

from the paper by Conrad et. al.

Additional documents

List of deletions: Included as Excel spreadsheets, Supplementary Table1.xls and Supple-

mentary Table2.xls.

List of CGH validation results: Included as Excel spreadsheet Supplementary Table3.xls.

List of genes involved: Included as Excel spreadsheet Supplementary Table4.xls.

Example CGH plots: Included as pdf document Supplementary Note.pdf.

CGH validation experiments

∗Address for correspondence: Dept of Human Genetics, The University of Chicago, 920 E 58th St–CLSC

507, Chicago IL 60637, USA. Email: [email protected]; Fax: 773: 834 0505; Phone: 773 834 5248.

1

Figure S.1: Plot of estimated false positive rate against threshold number of probes

within the minimal deleted interval. The noise apparent within log2 ratios for CGH

on oligonucleotide microarrays is such that a single probe provides insufficient information

to reliably detect the presence of a deletion. Therefore we sought to identify the minimum

number of probes needed to reliably infer the presence or absence of a deletion. We reasoned

that if this threshold number of probes was set too low then we should see a high rate of false

negative validations (true deletion classified as a false positive), which would lead to an over-

estimation of the false positive rate of candidate deletions identified from MIs. Consequently,

we decided to increase this threshold number from 1 to 70 and estimate the false positive rate

for each threshold to identify at what threshold value the estimated false positive rate among

113 candidate deletions (with at least 1 unique probe within the minimal deleted interval)

appears to asymptote onto the true false positive rate. We assessed whether a candidate

deletion was a false positive or not by assessing the p value of the Mann-Whitney U test for

each deletion (Methods) against a p value of 0.05 with (red) and without (blue) Bonferoni

correction. Power curves (black) are fitted to these plots of false positive rates. Note that

this analysis was performed prior to manual inspection of the CGH log2 ratio plots, and

so the absolute false positive rates are over-estimates due to the inclusion within the false

positives of true deletions shared with the reference individual (Methods). However, this

sharing of deletions between test and reference individuals is not related to the number of

probes within the minimal deleted interval and so it does not prevent the false positive rate

reaching an asymptote. From this analysis we chose a threshold

2

-1.5

-1

-0.5

0

0.5

1

1.5

570000 590000 610000 630000 650000 670000 690000 710000 730000

NA18500 NA18860 del1 del2 del3 30 per. M ov. Avg. (NA18500) 30 per. M ov. Avg. (NA18860)

Log 2

Pro

be R

atio

Figure S.2: Three deletion calls represent one deletion event. There are several in-

stances in the validation results where multiple, adjacent deletions predicted in the HapMap

are resolved as a single deletion event with CGH. Here, data from two array CGH exper-

iments are plotted, in one of which three deletions were called within the interval in the

test individual (grey/black). In the other experiment the test individual is undeleted for the

same deletion (blue). Values for individual probes are plotted as well as a moving average

to indicate the noise in the data. Black bars at log2 ratio of 1 indicate the position of the 3

minimally deleted individuals called within this region in the same individual. The physical

position for this locus, on 9pter, is indicated on the x-axis in units of base pairs.

The following figures are a representative sample of the most common outcomes we encoun-

tered during validation.

3

-1.5

-1

-0.5

0

0.5

1

1.5

10.04 M b 10.05 M b 10.06 M b 10.07 M b 10.08 M b 10.09 M b 10.10 M b 10.11 M b 10.12 M b 10.13 M b

flank1 deletion m in flank2 30 per.M ov. Avg.(NA18506)

30 per.M ov. Avg.(NA18860) 30 per.M ov. Avg.(NA18857) 30 per.M ov. Avg.(NA18515) 30 per.M ov. Avg.(NA18854)

30 per.M ov. Avg.(NA18521) 30 per.M ov. Avg.(NA18503) 30 per.M ov. Avg.(NA18500)

Log 2

Pro

be R

atio

Figure S.3: A typical deletion in the test individual. The solid black line at y = 1

indicates the minimum possible length of the deletion based on HapMap data, while the

dashed black lines indicate flanking sequence predicted to not be deleted. Coloured lines

indicate moving average for 8 CGH experiments. The deletion segregating here was predicted

in only one of these 8 individuals, but was found in 3. Locus is on chromosome 1.

-1.5

-1

-0.5

0

0.5

1

1.5

46.98 M b 46.99 M b 47.00 M b 47.01 M b 47.02 M b 47.03 M b 47.04 M b 47.05 M b


30 per.M ov. Avg.(NA18860) 30 per.M ov. Avg.(NA18857) 30 per.M ov. Avg.(NA18515) 30 per.M ov. Avg.(NA18854)

30 per.M ov. Avg.(NA18521) 30 per.M ov. Avg.(NA18503) 30 per.M ov. Avg.(NA18500)

Log 2

Pro

be R

atio

Figure S.4: A typical false positive. Locus is on chromosome 12.

4

-1.5

-1

-0.5

0

0.5

1

1.5


flank1 deletion m in flank2 30 per. M ov.Avg. (NA18506)

30 per.M ov. Avg. (NA18860) 30 per. M ov. Avg.(NA18857) 30 per. M ov. Avg. (NA18515) 30 per. M ov.Avg. (NA18854)

30 per.M ov. Avg. (NA18521) 30 per. M ov. Avg.(NA18503) 30 per. M ov. Avg. (NA18500)

Log 2

Pro

be R

atio

Figure S.5: True positive masquerading as false positive. The log2 ratio shows no sign

of a deletion in the CGH experiment with the individual in which the deletion was called

(purple line). However, all other 7 CGH experiments exhibit a gain of log2 ratio in the same

region indicating that the reference individual most likely shares the same deletion as the

test individual in which the deletion was identified. Locus is on chromosome 2.

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5



30 per. M ov. Avg. (NA18860) 30 per. M ov. Avg. (NA18857) 30 per. M ov. Avg. (NA18515) 30 per.M ov. Avg.(NA18854)

30 per. M ov. Avg. (NA18521) 30 per. M ov. Avg. (NA18503) 30 per. M ov. Avg. (NA18500)

Log 2

Pro

be R

atio

Figure S.6: Complexity, unsure of call. There is an apparent deletion segregating at the

predicted deletion locus, but a deletion is not apparent in CGH comparison of the individual

NA18500 with the reference (light green) in which it is called. However, the region of the

deletion in covered by segmental duplications, and so we may not expect the probes to

accurately report in a quantitative manner. Also note a very obvious CNP in the flanking

sequence, with both duplications and deletions in this set of 8 test individuals. Locus is on

chromosome 3.

5

-1.5

-1

-0.5

0

0.5

1

1.5

11.90 M b 11.95 M b 12.00 M b 12.05 M b 12.10 M b 12.15 M b 12.20 M b 12.25 M b 12.30 M b 12.35 M b

flank1 deletion m in flank2 30 per.M ov. Avg. (NA18506)

30 per.M ov. Avg. (NA18860) 30 per.M ov. Avg. (NA18857) 30 per.M ov. Avg. (NA18515) 30 per.M ov. Avg. (NA18854)

30 per.M ov. Avg. (NA18521) 30 per.M ov. Avg. (NA18503) 30 per.M ov. Avg. (NA18500)

Log 2

Pro

be R

atio

Figure S.7: Complexity, unsure of call. Highly segmentally duplicated, gap in reference

sequence within tiled region. Clear CNP flanking the gap. Impossible to call whether deletion

exists or not. Locus is on 8p23.

6

Genomic analyses of deletions

In this section we describe various genome-level analyses that we performed.

Centromeres and telomeres. Due to the enrichment of segmental duplications near

pericentromeric and subtelomeric regions of the genome, a reasonable hypothesis is that

rearrangements may occur more frequently in these regions. We counted the number of

deletions that fell within 1Mb and 2Mb of centromeres and telomeres. The numbers for

each population were not significantly different from a model where deletions are placed at

random on the genome.

Distance YRI deletions CEU deletions Genome Sequence

Within 1Mb 3.3% 1.1% 2.8%

Within 2Mb 5.8% 5.8% 5.5%

X-chromosome deletions. At first blush, there appears to be a considerable deficit of

X-chromosome loci in our set of deletions. If the average rate at which deletions occur on

the autosomes is the same as the rate of deletion on the X, we would expect to find 8 X-

linked deletions in the YRI and 7 in the CEU, accounting for the reduced power to detect X

chromosome deletions. The observation of just 1 deletion on the non-pseudoautosomal X in

each population may reflect strong negative selection against deletions in males. However, a

careful analysis, which also considers differences in the effective population size between au-

tosomes and the X, as well as differences in the average recombination rate on the autosomes

and the X, will be required to draw substantive conclusions.

Transmission rates by sex of parent. We examined whether there was a sex-specific

bias in the origin of deletion transmissions in the HapMap data. There was a small but

non-significant excess of deletions transmitted from mothers in both populations (51.01%

CEU, 51.52% YRI).

Segmental duplications. It has previously been reported that segmental duplications

are hotbeds for copy number polymorphism (Lupski, 1998; Sharp et al., 2005; Tuzun et al.,

2005). In our samples we find that 16% of the deletions contain or overlap segmental du-

plication regions. A higher proportion of the deletions shared between the CEU and YRI

samples (14/37=38%) lay in segmental duplications. Overall, the overlap with segmental

duplications was much lower than the 55% of variants previously reported in such regions

(Tuzun et al., 2005). This difference may reflect an under-representation of segmental du-

plications by the HapMap. It also seems likely that the mutation mechanism for the smaller

7

deletions detected in this study will be different from the intermediate- and large-scale CNPs

reportedly affiliated with segmental duplications.

8

Distribution of functional categories for putative deleted SNPs

Category CEU deletions YRI deletions Entire HapMap

Coding non-synonymous 28 66 10,449

Coding-synonymous 17 37 6223

Intron 760 1,423 327,541

Locus-region 183 261 29,976

Mrna-utr 148 239 54,534

NA 3,154 4,360 577,183

Nonidentical 22 84 10650

Splice-site 0 0 60

Total 4,312 6,470 1,017,246

Table S.1: In order to assess the proportion of genes in deletions relative to the HapMap as a

whole, we identified every SNP that fell within a deletion-compatible region in CEU or YRI.

The rs#s for these SNPs were then used to obtain functional information from dbSNP (build

124). Functional information was also sought for every SNP in the Phase I HapMap. “Locus

region” indicates a SNP within 2 kb of a gene; NA, intergenic SNPs or SNPs for which

functional information was unavailable; nonidentical, SNPs with multiple but non-identical

annotations.

9

Frequencies of Predicted Deletions

CEU YRI Combined

Frequency 2MI+ 1MI+ 2MI+ 1MI+ 2MI+ 1MI+

1 copy 167 107 280 137 412 228

2 copies 34 44 74 92 100 126

3 copies 11 23 22 59 30 68

4 copies 12 21 11 31 21 50

5 copies 0 11 6 27 12 30

> 5 copies 4 21 3 46 11 76

Total 228 227 396 392 586 578

Table S.2: The frequency spectrum of a set of polymorphic loci contains information about

the demographic and selective forces acting upon the genome around those loci. By making

the simplifying assumption that overlapping deletion-compatible regions in different individ-

uals represent alleles of the same deletion polymorphism, we constructed a crude frequency

spectrum for deletion loci detected in this study. In order to better approximate the true

frequency spectrum, we did a second analysis, this time including individuals with deletion-

compatible regions that contain only 1 MI as well as the deletion- compatible regions with

2 or more MIs. Occasionally, a 1-MI deletion will overlap two separate 2-MI+ deletion loci;

when this occurs, we collapse both 2-MI+ loci into a single locus. This results in a slight

reduction in the number of loci called in the 1-MI+ analysis. We present the results of both

analyses side-by-side for comparison. These data should be treated with caution since there

will be both false positives and false negatives. Combined: frequencies in the combined CEU

and YRI samples; 1MI+: analysis including 1-MI deletions; 2MI+: analysis excluding 1-MI

deletions.

10

X

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Figure S.8: Genome map of observed deletions. This cartoon depicts the physical

location of each region of the genome found to be hemizygous in at least one individual.

Blue lines indicate the location of a region found in CEU, red lines correspond to YRI. The

centromeres are indicated by purple bars.

11

-15.0

-10.0

-5.0

0.0

5.0

10.0

15.0

% d

elet

ions

-%ge

nom

e

Molecular function unclassifiedNucleic acid bindingTranscription factorTransporterHydrolaseSignaling moleculeOxidoreductaseMembrane traffic proteinChaperoneMiscellaneous functionIsomeraseLyaseExtracellular matrixLigaseSelect calcium binding proteinSynthase and synthetaseCytoskeletal proteinTransfer/carrier proteinIon channelCell junction proteinTransferasePhosphataseSelect regulatory moleculeCell adhesion moleculeKinaseProteaseDefense/immunity proteinReceptor

Figure S.9: Analysis of deletion gene molecular function. We identified 267

genes in RefSeq which either overlapped or encompassed a predicted deletion. To in-

vestigate whether these genes were a random sample of genes from the genome, we

used the GeneID associated with each accession to query the PANTHER database at

http://panther.appliedbiosystems.com. These 267 genes fell into 28 molecular function cat-

egories at the highest level of ontological classification. For each category, we calculated the

proportion of all deletion genes that fall within that category, as well as the proportion of

all genes in the genome that fall within that category. The chart displays the difference in

percentage between the two proportions.

12

-15

-10

-5

0

5

10

15

20

% d

ele

tio

ns

-%

ge

no

me

Biological process unclassifiedNucleoside, nucleotide and nucleic acid metabolismTransportCell proliferation and differentiationCell cycleApoptosisAmino acid metabolismDevelopmental processesElectron transportCell structure and motilityIntracellular protein trafficOther metabolismBlood circulation and gas exchangeLipid, fatty acid and steroid metabolismOncogenesisCoenzyme and prosthetic group metabolismProtein targeting and localizationCarbohydrate metabolismMuscle contractionProtein metabolism and modificationNeuronal activitiesCell adhesionImmunity and defenseSensory perceptionSignal transduction

Figure S.10: Analysis of deletion gene biological processes We identified 267

genes in RefSeq which either overlapped or encompassed a predicted deletion. To in-

vestigate whether these genes were a random sample of genes from the genome, we

used the GeneID associated with each accession to query the PANTHER database at

http://panther.appliedbiosystems.com. These 267 genes fell into 25 biological process cate-

gories at the highest level of ontological classification. For each category, we calculated the

proportion of all deletion genes that fall within that category, as well as the proportion of

all genes in the genome that fall within that category. The chart displays the difference in

percentage between the two proportions.

13

Validation using Real-Time PCR.

Twelve predicted deletions at 10 loci were selected for validation using real-time PCR. We de-

signed a total of 18 primer pairs that overlapped or flanked SNPs within 10 putative deletion

loci, using Primer Express v2.0 software (ABI). The chosen SNPs represented a mix of both

Mendelian-inconsistent and uninformative SNPs. The primer sequences for our endogenous

control, exon 2 of von Hippel-Lindau 1 (vHL1), were obtained from previously published

work (Hoebeeck et al., 2005). Primers were designed to produce amplicons of approximately

100bp. Genomic DNA for seven CEPH inviduals (NA10857, NA12043, NA12044, NA12264,

NA12234, NA10863, NA10851) was obtained from Coriell Cell Repositories (Camden, NJ).

Real-time PCR reactions were run in 50µL volumes, using the Platinum SYBR Green qPCR

Supermix reagent cocktail (Invitrogen) and a dNTP mix (Promega), on an ABI Prism

7900HT (PE Applied-Biosystems). The PCR experiments used an initial ramp-up of 50◦

for 2 minutes, followed by 40 cycles of (95◦ for 30 seconds; 57◦ for 30 seconds). The reaction

was terminated by a final heat dissociation cycle of 95◦ for 15 seconds, 57◦ for 15 seconds,

95◦ for 15 seconds, to verify the presence of a single PCR product.

For each primer set, each sample was run in duplicate or triplicate for at least 4 different

concentrations of genomic DNA, within the three members of a trio believed to carry a

particular deletion. We used a relative standard curve method to assess each individual’s

genomic copy-number for a given target region, following the manufacturer’s instructions

(Applied Biosystems).

Results are shown in Figure S.11. All ten candidate deletion regions show patterns consistent

with deletion transmission from parent to child.

14

Validation regions examined by real-time PCR

Chromosome Start Length SNPs MIs PCRs Region

1 146,393,913 92,766 3 2 1 7

1 146,510,265 10,492 3 1 1 8

3 150,282,950 11,424 2 2 2 1

4 169,491,071 203,747 56 3 3 9

6 10,574,024 46,284 17 4 3 2

8 14,841,542 230,079 27 10 1 3

8 15,087,195 130,346 19 12 2 4

8 15,271,799 413,183 56 9 1 5

8 104,088,748 1,061 2 2 1 6

20 14,798,454 18773 6 4 3 10

Table S.3: Summary of 10 genomic regions used in validation. Start, beginning of deletion-

compatible region in base pairs (NCBI 34); Length, length of deletion-compatible region;

SNPs, number of SNPs within region; MIs, number of Mendelian incompatabilities detected

within region; PCRs, number of non-overlapping primer sets tested within region; Region,

number used to label PCRs in Figure S.11. The region with just one MI was part of a

multiple-MI region in an earlier HapMap build, when the experiment was designed. The

regions described here were selected on the basis of an early build of the HapMap (August

2004), so start and stop positions may not align precisely to a deletion identified in the final

analysis.

15

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure S.11: Validation of predicted deletions. Eighteen regions from twelve predicted dele-

tions were examined using quantitative PCR. The relative quantity of DNA in the predicted

deletion carriers (red points) was consistently lower than in the non-carrier parent for each

trio (blue points). The quantitative values for each region are normalized by setting the

blue points to 1. The dotted line indicates the 50% level predicted for hemizygous regions.

The numbers floating above each point can be used to match an experimental result to its

genomic region in Table S.3.

16

Deletion filtering to remove false positives.

As described in the Methods, we performed filtering to remove candidate deletions that may

not be transmitted deletions. That is, we discarded deletion regions that (i) contained one

or more Type II MIs, as these can indicate a somatic deletion in a parent; (ii) occurred

in unusual clusters of deletions in a trio that was an outlier in terms of total number of

deletions; (iii) occurred in known regions of somatic recombination.

In order to illustrate the nature of these changes, we provide the following summary of

candidate deletion regions removed. (i). One CEU trio, part of family 1459, was an extreme

outlier, with 91 deletion regions, compared to the CEU average of 15. The cell line for

individual CEPH1459.11, the father of this trio, has apparently ejected a substantial portion

of chromosome 1q (approximately 104 Mb), creating a large region with high rates of both

Type I and Type II MIs (such a pattern is expected for a chromosome that is transmitted

to the child, and then subsequently lost in the parent). The breakpoint lies within 1q21,

which is a noted hotspot for chromosome breaks in cells infected by human cytomegalovirus

(a close relative to Epstein-Barr virus) (Fortunato et al., 2000) and the break may therefore

have occurred during the transformation process. Not including the chromosome 1q ejection

or deletion regions removed due to other considerations, there were 8 deletions in YRI and 3

deletions in CEU that were compatible with cell-line specific deletion. (ii). We identified 3

YRI trios who appeared as outliers in terms of either their total number of deletions or the

number of deletions manifest on a single chromosome. Trios 17 and 45 both had abnormally

large numbers of MIs and deletions regions on chromosome 12. We were able to confirm

with the genotyping center responsible for the abberant genotypes that sample mix-ups

had occured with both trios (Martin Moorhead, personal communication). We removed all

candidate deletion regions from these two trios on chromosome 12; 22 regions from trio 17

and 20 regions from trio 45. YRI trio 23 was an outlier on chromosome 10, with 32 deletion

regions between 13.4Mb to 26.95Mb, a density comparable to the chromosome 1q ejection.

The MIs producing these deletions are all of category “A” (see Methods), with relatively

few Type II MIs (diagnostic of a cell-line deletion). There was not a noticeable elevation in

homozygosity for any member of the trio in this region (data not shown). The responsible

genotyping center has identified the single experiment in which aberrant genotypes in this

region of chromosome 10 were generated for the mother of this trio (NA18855), and has

requested that these data be removed from future HapMap releases (Sarah Hunt, personal

communication). (iii) Deletion regions recorded in the following regions of known somatic

recombination were discarded:

17

Region Chromosome Approx Start (Mb,NCBI34) Approx Stop (Mb,NCBI34)

IGK 2 89.00 89.50

TCRG 7 37.92 38.10

TCRB 7 141.60 141.96

TCRA and TCRD 14 20.55 21.01

IGH 14 104.00 105.00

IGL 21 20.70 22.00

In total, 26 CEU and 6 YRI deletions were discarded that fell in class (iii) but no other class.

MCMC implementation.

In this section, we describe the MCMC algorithm that was used to generate some of the

statistical results in the main text. The primary quantities of interest are the total number

of haploid deletions transmitted to the HapMap offspring (observed and unobserved), and

the true lengths of the deletions that were observed. The data used in this method consist

of the following:

N = {n1, ..., n23} where n23 is the number of deletions detected on the X chromosome and

ni is the number of deletions detected on chromosome i for all other i.

M = {m1, ..., mJ} where mj = (minj , maxj) gives the largest and smallest possible size

for observed deletion j, and J =∑23

i=1 ni. The physical distance between the first deletion-

incompatible SNP distal and proximal to the deletion define maxj, and the distance between

the most distal and proximal MI within the deletion-compatible region defines minj.

Additional information used in the analyses are

B = {b1, ..., b23} the sizes (bp) of the chromosomes and pi(l), the power to detect a deletion

of size l on chromosome i. This latter function was estimated by simulation as described in

the Methods section of the main text.

Model. We created a formal hierarchical model to relate the number and length of observed

deletions to the total number and length distribution of deletions in both the CEU and YRI

samples. In this model the true length of the jth detected deletion is lj, and L = {l1, ..., lJ}.

Then the frequency of a deletion of length l, denoted f(l), follows a gamma distribution with

shape parameter α and scale parameter β. The total number of deletions in the sample on

the ith chromosome is ti, T = {t1, ..., t23}. The ti are poisson distributed with rate 30cbi,

independent ∀ i. The number of deletions observed in the HapMap on chromosome ni, is

distributed

18

ni ∼ Bino(λi, ti)

where

λi =

∫ inf

0

pi(l)Ga(l; α, β)dl

which is the power to detect a deletion on chromosome i, unconditional on its length. We use

the Gauss-Kronrod 21-point integration rule to numerically evaluate λi (Press et al., 1992).

We use flat priors on the scale parameter, α, the shape paramter β, and on c, the total

number of deletions in the sample. The prior distribution of deletion sizes (ie, unconditional

on detection) will be Ga(α, β).

We now wish to sample from Pr(c, T, L, α, β|N, M). This is done using the following

Metropolis-Hastings algorithm.

Initalize all parameters with plausible starting values α(0), β(0), c(0), T (0), L(0). Then update

in turn as follows:

Step 1. Propose c′ ∼ N(c, σ21). If c′ ∈ (0, max) then accept c′ if a uniform random variable

u satisfies

u <

J∏i=1

Po(ti, 30c′bi)

Po(ti, 30cbi)=

J∏i=1

(c′

c

t

i)exp(30bi(c − c′))

Step 2. For each ti, propose t′i from a symmetric discrete distribution centered on ti. If

ti ≥ ni then accept t′i if

u <

J∏i=1

Po(ti, 30c′bi)

Po(ti, 30cbi)∗

Bin(ni; λi, t′

i)

Bin(ni; λi, ti)

=(30cbi)

(t′i−ti)ti!

ti!∗

(ti − ni)!

(t′i − ni)!(1 − λi)

(t′i−ti)

Step 3. Update each lj for the detected deletions, in turn, as follows. Simulate l′j ∼

Uni(minj , maxj). Accept l′j if

u <Ga(l′j; α, β)pi(l

′

j)

Ga(lj; α, β)pi(lj)

= (y′

y)(α−1)e((y−y′)/β)

Step 4. Propose α′ ∼ N(α, σ22). If α′ ∈ (0, maxα) then accept α′ if

u <

J∏i=1

Pr(lj|α′, β, detected)

Pr(lj|α, β, detected)

19

where Pr(lj|α, β, detected)

=

1Γ(α)Γ(β)

lα−1j e

−lj

β pi(lj)

λi(α, β)

Step 5. Propose β ′ ∼ N(β, σ23. If β ′ ∈ (0, maxα) then accept β ′ if

u <

J∏i=1

Pr(lj|α, β ′, detected)

Pr(lj|α, β, detected)

where Pr(lj|α, β, detected)

=

1Γ(α)Γ(β)

l(α−1)j e

(−lj

β)pi(lj)

λi(α, β)

We mix over possible deletion lengths by choosing li uniformly at random over the acceptable

interval (maximum possible size of deletion - minimum possible size of deletion).

In practice, we collect 10,000 samples of α(m), β(m), c(m), T (m), L(m) from the Markov chain

once it has reached stationarity.

The method was tested using simulated deletions from the HapMap data, and was found to

perform well at estimating c and L for a wide range of α and β (Figure S.12). As α approaches

0, however, the estimates of c and L become increasingly biased towards smaller values; we

therefore suspect that the values of L and c reported in this paper are underestimates.

20

0 50 100 150 200

020

040

060

080

0

True Length (kb)

Est

imat

ed le

ngth

(kb)

Figure S.12: Estimation accuracy of deletion lengths using our MCMC algorithm. Results

of analysis of simulated dataset based on CEU chromosome 2. The true length of each

simulated deletion is plotted against an estimation of the length by (a), using the maximum

possible length, red points; (b) the minimum possible length, which is the distance between

the most distal MI and most proximal MI within the deletion-compatible region, blue points;

(c) length estimated using MCMC, black points. Points from a completely accurate method

would fall along the black line. Three thousand deletions were simulated with sizes drawn

from a Gamma distribution with parameters α = 2.0 and β = 2 × 104.

21

Estimated Rates of CEU Trio Configurations in Phase I HapMap

Center Platform A B C D E F G Total

perlegen perlegen 0.00018(±5.1 × 10−5) 0.00017(±4.9 × 10−5) 5.6 × 10−5(±2.8 × 10−5) 0.67(±0.0031) 0.055(±0.00088) 0.054(±0.00087) 0.22(±0.0018) 70980

broad illumina 0.00026(±1.2 × 10−5) 0.00028(±1.2 × 10−5) 0.00045(±1.5 × 10−5) 0.61(±0.00056) 0.066(±0.00019) 0.066(±0.00019) 0.26(±0.00037) 1927348

broad sequenom/affy 0.00042(±1.2 × 10−5) 0.00039(±1.1 × 10−5) 0.00019(±7.8 × 10−6) 0.5(±0.0004) 0.086(±0.00017) 0.085(±0.00017) 0.33(±0.00032) 3090378

chmc illumina 0.00032(±1.3 × 10−5) 0.00032(±1.3 × 10−5) 0.00022(±1.1 × 10−5) 0.52(±0.00052) 0.081(±0.00021) 0.08(±0.0002) 0.32(±0.00041) 1921888

chmc sequenom 0.00063(±2.7 × 10−5) 0.00057(±2.5 × 10−5) 0.00036(±2 × 10−5) 0.52(±0.00077) 0.086(±0.00031) 0.085(±0.00031) 0.31(±0.00059) 884338

illumina illumina 9.1 × 10−5(±4 × 10−6) 9.5 × 10−5(±4.1 × 10−6) 7.1 × 10−5(±3.5 × 10−6) 0.47(±0.00029) 0.087(±0.00012) 0.086(±0.00012) 0.35(±0.00025) 5805924

imsut-riken invader 1.4 × 10−5(±1.5 × 10−6) 1.4 × 10−5(±1.5 × 10−6) 0(±0) 0.56(±0.0003) 0.074(±0.00011) 0.076(±0.00011) 0.29(±0.00022) 6361818

mcgill-gqic illumina 6.3 × 10−5(±4.6 × 10−6) 5.2 × 10−5(±4.2 × 10−6) 3.1 × 10−5(±3.3 × 10−6) 0.54(±0.00043) 0.078(±0.00016) 0.077(±0.00016) 0.31(±0.00032) 2906610

bcm mip 0.00053(±1.9 × 10−5) 0.00059(±2 × 10−5) 0.00089(±2.5 × 10−5) 0.54(±0.00062) 0.077(±0.00023) 0.078(±0.00023) 0.3(±0.00046) 1411300

sanger illumina 0.00054(±9 × 10−6) 0.00058(±9.3 × 10−6) 0.0014(±1.4 × 10−5) 0.55(±0.00029) 0.075(±0.00011) 0.076(±0.00011) 0.3(±0.00021) 6671012

ucsf-wu PE FP 0.00031(±2.9 × 10−5) 0.00029(±2.8 × 10−5) 0.00021(±2.4 × 10−5) 0.49(±0.0012) 0.088(±0.0005) 0.086(±0.00049) 0.34(±0.00097) 359212

Table S.4: We present here the rate matrix of autosomal CEU trio configurations used in our simulations. For each combination of

genotyping center and genotyping platform, we have estimated the frequency of each configuration (i.e. A-G; see Methods) considered

by our deletion-detection algorithm. The standard error follows in parantheses. Each center-platform combination has been considered

separately to accomodate inter-platform and inter-operator variation in these rates.

22

Estimated Rates of YRI Trio Configurations in Phase I Hapmap

Center Platform A B C D E F G Total

broad illumina 0.00041(±1.5 × 10−5) 0.0004(±1.5 × 10−5) 0.00049(±1.6 × 10−5) 0.55(±0.00054) 0.076(±0.0002) 0.078(±0.0002) 0.29(±0.00039) 1910458

broad sequenom/affy 0.0016(±2.3 × 10−5) 0.0015(±2.3 × 10−5) 0.00039(±1.2 × 10−5) 0.51(±0.00042) 0.087(±0.00017) 0.087(±0.00017) 0.31(±0.00032) 2917710

chmc illumina 0.00042(±1.4 × 10−5) 0.00043(±1.4 × 10−5) 0.00015(±8.2×−6) 0.48(±0.00047) 0.088(±0.0002) 0.088(±0.0002) 0.34(±0.0004) 2195580

chmc sequenom 0.00093(±4.3 × 10−5) 0.00097(±4.4 × 10−5) 0.00034(±2.6 × 10−5) 0.51(±0.001) 0.088(±0.00042) 0.085(±0.00041) 0.31(±0.00079) 502676

illumina illumina 6.6 × 10−5(±3.4×−6) 5.6 × 10−5(±3.1×−6) 2.7 × 10−5(±2.2×−6) 0.48(±0.00029) 0.089(±0.00012) 0.089(±0.00012) 0.35(±0.00024) 5809386

imsut-riken invader 0.00012(±4.4×−6) 0.00012(±4.5×−6) 6.8 × 10−7(±3.4 × 10−7) 0.53(±0.0003) 0.082(±0.00012) 0.081(±0.00012) 0.31(±0.00023) 5923486

mcgill-gqic illumina 0.00024(±9.2×−6) 0.00024(±9.1×−6) 3.5 × 10−5(±3.5×−6) 0.5(±0.00042) 0.084(±0.00017) 0.085(±0.00017) 0.33(±0.00034) 2846328

bcm mip 0.0013(±3.1 × 10−5) 0.0014(±3.2 × 10−5) 0.0014(±3.1 × 10−5) 0.52(±0.00062) 0.081(±0.00024) 0.082(±0.00024) 0.31(±0.00048) 1367738

sanger illumina 0.00092(±1.1 × 10−5) 0.00083(±1.1 × 10−5) 0.0017(±1.6 × 10−5) 0.52(±0.00027) 0.081(±0.00011) 0.081(±0.00011) 0.31(±0.00021) 6974062

ucsf-wu PE FP 0.00053(±4.1 × 10−5) 0.00043(±3.7 × 10−5) 0.00046(±3.8 × 10−5) 0.5(±0.0013) 0.09(±0.00054) 0.089(±0.00053) 0.32(±0.001) 314220

Table S.5: We present here the rate matrix of autosomal YRI trio configurations used in our simulations. For each combination of

genotyping center and genotyping platform, we have estimated the frequency of each configuration (i.e. A-G; see Methods) considered

by our deletion-detection algorithm. The standard error for each rate follows in parantheses. Each center-platform combination has

been considered separately to accomodate inter-platform and inter-operator variation in these rates.

23

Enumeration of Trio Configurations

Parent 1 Parent 2 Child Class

AA AA AA D

AA AA AB G

AA AA BB C

AA AA NN D

AA AB AB G

AA AB AA E/F

AA AB BB A/B

AA AB NN D

AA NN AA D

AA NN AB G

AA NN BB A/B

AA NN NN D

NN NN AA D

NN NN AB G

NN NN BB D

NN NN NN D

BB BB AA C

BB BB AB G

BB BB BB D

BB BB NN D

BB AA AA A/B

BB AA AB G

BB AA BB A/B

BB AA NN D

BB AB AA A/B

BB AB AB G

BB AB BB E/F

BB AB NN D

BB NN AA C

BB NN AB G

BB NN BB D

BB NN NN D

AB AB AB G

AB AB AA G

AB AB BB G

AB AB NN G

AB NN AB G

AB NN AA E/F

AB NN BB E/F

AB NN NN D

Table S.6: Here we present in full the coding scheme used in our deletion-detection algorithm

(See Methods). There are 43 total possible trio genotype configuration for a diallelic SNP

locus, including potential missing data configurations. This number can be reduced to 40

when symmetrical configurations are considered as a single type (e.g., mom AB dad AA =

mom AA dad AB).

24

References

Fortunato, E. A., Dell’Aquila, M. L., and Spector, D. H. (2000). Specific chromosome 1

breaks induced by human cytomegalovirus. Proc Natl Acad Sci U S A, 97:853–8.

Hoebeeck, J., vanderLuijt, R., Poppe, B., DeSmet, E., Yigit, N., Claes, K., Zewald, R.,

deJong, G. J., DePaepe, A., Speleman, F., and Vandesompele, J. (2005). Rapid detection

of VHL exon deletions using real-time quantitative PCR. Lab Invest, 85:24–33.

Lupski, J. R. (1998). Genomic disorders: structural features of the genome can lead to DNA

rearrangements and human disease traits. Trends Genet, 14:417–22.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical

Recipes in C: The Art of Scientific Computing. Cambridge University Press, 2nd edition.

Sharp, A. J., Locke, D. P., McGrath, S. D., Cheng, Z., Bailey, J. A., Vallente, R. U., Pertz,

L. M., Clark, R. A., Schwartz, S., Segraves, R., Oseroff, V. V., Albertson, D. G., Pinkel,

D., and Eichler, E. E. (2005). Segmental duplications and copy-number variation in the

human genome. Am J Hum Genet, 77:78–88.

Tuzun, E., Sharp, A. J., Bailey, J. A., Kaul, R., Morrison, V. A., Pertz, L. M., Haugen, E.,

Hayden, H., Albertson, D., Pinkel, D., Olson, M. V., and Eichler, E. E. (2005). Fine-scale

structural variation of the human genome. Nat Genet.

25

a high-resolution survey of deletion polymorphism in the ... · a high-resolution survey of...

Documents