a high-resolution survey of deletion polymorphism in the ... · a high-resolution survey of...
TRANSCRIPT
A High-Resolution Survey of Deletion Polymorphism in
the Human Genome–Supplementary Information
Donald F. Conrad∗, T. Daniel Andrews#, Nigel P. Carter#,
Matthew E. Hurles#, and Jonathan K. Pritchard∗∗
∗Department of Human Genetics
The University of Chicago.#Wellcome Trust Sanger Institute
Hinxton, Cambridge, UK.
October 5, 2005
This document and the accompanying spreadsheets include additional methods and data
from the paper by Conrad et. al.
Additional documents
List of deletions: Included as Excel spreadsheets, Supplementary Table1.xls and Supple-
mentary Table2.xls.
List of CGH validation results: Included as Excel spreadsheet Supplementary Table3.xls.
List of genes involved: Included as Excel spreadsheet Supplementary Table4.xls.
Example CGH plots: Included as pdf document Supplementary Note.pdf.
CGH validation experiments
∗Address for correspondence: Dept of Human Genetics, The University of Chicago, 920 E 58th St–CLSC
507, Chicago IL 60637, USA. Email: [email protected]; Fax: 773: 834 0505; Phone: 773 834 5248.
1
Figure S.1: Plot of estimated false positive rate against threshold number of probes
within the minimal deleted interval. The noise apparent within log2 ratios for CGH
on oligonucleotide microarrays is such that a single probe provides insufficient information
to reliably detect the presence of a deletion. Therefore we sought to identify the minimum
number of probes needed to reliably infer the presence or absence of a deletion. We reasoned
that if this threshold number of probes was set too low then we should see a high rate of false
negative validations (true deletion classified as a false positive), which would lead to an over-
estimation of the false positive rate of candidate deletions identified from MIs. Consequently,
we decided to increase this threshold number from 1 to 70 and estimate the false positive rate
for each threshold to identify at what threshold value the estimated false positive rate among
113 candidate deletions (with at least 1 unique probe within the minimal deleted interval)
appears to asymptote onto the true false positive rate. We assessed whether a candidate
deletion was a false positive or not by assessing the p value of the Mann-Whitney U test for
each deletion (Methods) against a p value of 0.05 with (red) and without (blue) Bonferoni
correction. Power curves (black) are fitted to these plots of false positive rates. Note that
this analysis was performed prior to manual inspection of the CGH log2 ratio plots, and
so the absolute false positive rates are over-estimates due to the inclusion within the false
positives of true deletions shared with the reference individual (Methods). However, this
sharing of deletions between test and reference individuals is not related to the number of
probes within the minimal deleted interval and so it does not prevent the false positive rate
reaching an asymptote. From this analysis we chose a threshold
2
-1.5
-1
-0.5
0
0.5
1
1.5
570000 590000 610000 630000 650000 670000 690000 710000 730000
NA18500 NA18860 del1 del2 del3 30 per. M ov. Avg. (NA18500) 30 per. M ov. Avg. (NA18860)
Log 2
Pro
be R
atio
Figure S.2: Three deletion calls represent one deletion event. There are several in-
stances in the validation results where multiple, adjacent deletions predicted in the HapMap
are resolved as a single deletion event with CGH. Here, data from two array CGH exper-
iments are plotted, in one of which three deletions were called within the interval in the
test individual (grey/black). In the other experiment the test individual is undeleted for the
same deletion (blue). Values for individual probes are plotted as well as a moving average
to indicate the noise in the data. Black bars at log2 ratio of 1 indicate the position of the 3
minimally deleted individuals called within this region in the same individual. The physical
position for this locus, on 9pter, is indicated on the x-axis in units of base pairs.
The following figures are a representative sample of the most common outcomes we encoun-
tered during validation.
3
-1.5
-1
-0.5
0
0.5
1
1.5
10.04 M b 10.05 M b 10.06 M b 10.07 M b 10.08 M b 10.09 M b 10.10 M b 10.11 M b 10.12 M b 10.13 M b
flank1 deletion m in flank2 30 per.M ov. Avg.(NA18506)
30 per.M ov. Avg.(NA18860) 30 per.M ov. Avg.(NA18857) 30 per.M ov. Avg.(NA18515) 30 per.M ov. Avg.(NA18854)
30 per.M ov. Avg.(NA18521) 30 per.M ov. Avg.(NA18503) 30 per.M ov. Avg.(NA18500)
Log 2
Pro
be R
atio
Figure S.3: A typical deletion in the test individual. The solid black line at y = 1
indicates the minimum possible length of the deletion based on HapMap data, while the
dashed black lines indicate flanking sequence predicted to not be deleted. Coloured lines
indicate moving average for 8 CGH experiments. The deletion segregating here was predicted
in only one of these 8 individuals, but was found in 3. Locus is on chromosome 1.
-1.5
-1
-0.5
0
0.5
1
1.5
46.98 M b 46.99 M b 47.00 M b 47.01 M b 47.02 M b 47.03 M b 47.04 M b 47.05 M b
flank1 deletion m in flank2 30 per.M ov. Avg.(NA18506)
30 per.M ov. Avg.(NA18860) 30 per.M ov. Avg.(NA18857) 30 per.M ov. Avg.(NA18515) 30 per.M ov. Avg.(NA18854)
30 per.M ov. Avg.(NA18521) 30 per.M ov. Avg.(NA18503) 30 per.M ov. Avg.(NA18500)
Log 2
Pro
be R
atio
Figure S.4: A typical false positive. Locus is on chromosome 12.
4
-1.5
-1
-0.5
0
0.5
1
1.5
208.52 M b 208.53 M b 208.54 M b 208.55 M b 208.56 M b 208.57 M b 208.58 M b 208.59 M b
flank1 deletion m in flank2 30 per. M ov.Avg. (NA18506)
30 per.M ov. Avg. (NA18860) 30 per. M ov. Avg.(NA18857) 30 per. M ov. Avg. (NA18515) 30 per. M ov.Avg. (NA18854)
30 per.M ov. Avg. (NA18521) 30 per. M ov. Avg.(NA18503) 30 per. M ov. Avg. (NA18500)
Log 2
Pro
be R
atio
Figure S.5: True positive masquerading as false positive. The log2 ratio shows no sign
of a deletion in the CGH experiment with the individual in which the deletion was called
(purple line). However, all other 7 CGH experiments exhibit a gain of log2 ratio in the same
region indicating that the reference individual most likely shares the same deletion as the
test individual in which the deletion was identified. Locus is on chromosome 2.
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
199.14 M b 199.15 M b 199.16 M b 199.17 M b 199.18 M b 199.19 M b 199.20 M b 199.21 M b
flank1 deletion m in flank2 30 per.M ov. Avg.(NA18506)
30 per. M ov. Avg. (NA18860) 30 per. M ov. Avg. (NA18857) 30 per. M ov. Avg. (NA18515) 30 per.M ov. Avg.(NA18854)
30 per. M ov. Avg. (NA18521) 30 per. M ov. Avg. (NA18503) 30 per. M ov. Avg. (NA18500)
Log 2
Pro
be R
atio
Figure S.6: Complexity, unsure of call. There is an apparent deletion segregating at the
predicted deletion locus, but a deletion is not apparent in CGH comparison of the individual
NA18500 with the reference (light green) in which it is called. However, the region of the
deletion in covered by segmental duplications, and so we may not expect the probes to
accurately report in a quantitative manner. Also note a very obvious CNP in the flanking
sequence, with both duplications and deletions in this set of 8 test individuals. Locus is on
chromosome 3.
5
-1.5
-1
-0.5
0
0.5
1
1.5
11.90 M b 11.95 M b 12.00 M b 12.05 M b 12.10 M b 12.15 M b 12.20 M b 12.25 M b 12.30 M b 12.35 M b
flank1 deletion m in flank2 30 per.M ov. Avg. (NA18506)
30 per.M ov. Avg. (NA18860) 30 per.M ov. Avg. (NA18857) 30 per.M ov. Avg. (NA18515) 30 per.M ov. Avg. (NA18854)
30 per.M ov. Avg. (NA18521) 30 per.M ov. Avg. (NA18503) 30 per.M ov. Avg. (NA18500)
Log 2
Pro
be R
atio
Figure S.7: Complexity, unsure of call. Highly segmentally duplicated, gap in reference
sequence within tiled region. Clear CNP flanking the gap. Impossible to call whether deletion
exists or not. Locus is on 8p23.
6
Genomic analyses of deletions
In this section we describe various genome-level analyses that we performed.
Centromeres and telomeres. Due to the enrichment of segmental duplications near
pericentromeric and subtelomeric regions of the genome, a reasonable hypothesis is that
rearrangements may occur more frequently in these regions. We counted the number of
deletions that fell within 1Mb and 2Mb of centromeres and telomeres. The numbers for
each population were not significantly different from a model where deletions are placed at
random on the genome.
Distance YRI deletions CEU deletions Genome Sequence
Within 1Mb 3.3% 1.1% 2.8%
Within 2Mb 5.8% 5.8% 5.5%
X-chromosome deletions. At first blush, there appears to be a considerable deficit of
X-chromosome loci in our set of deletions. If the average rate at which deletions occur on
the autosomes is the same as the rate of deletion on the X, we would expect to find 8 X-
linked deletions in the YRI and 7 in the CEU, accounting for the reduced power to detect X
chromosome deletions. The observation of just 1 deletion on the non-pseudoautosomal X in
each population may reflect strong negative selection against deletions in males. However, a
careful analysis, which also considers differences in the effective population size between au-
tosomes and the X, as well as differences in the average recombination rate on the autosomes
and the X, will be required to draw substantive conclusions.
Transmission rates by sex of parent. We examined whether there was a sex-specific
bias in the origin of deletion transmissions in the HapMap data. There was a small but
non-significant excess of deletions transmitted from mothers in both populations (51.01%
CEU, 51.52% YRI).
Segmental duplications. It has previously been reported that segmental duplications
are hotbeds for copy number polymorphism (Lupski, 1998; Sharp et al., 2005; Tuzun et al.,
2005). In our samples we find that 16% of the deletions contain or overlap segmental du-
plication regions. A higher proportion of the deletions shared between the CEU and YRI
samples (14/37=38%) lay in segmental duplications. Overall, the overlap with segmental
duplications was much lower than the 55% of variants previously reported in such regions
(Tuzun et al., 2005). This difference may reflect an under-representation of segmental du-
plications by the HapMap. It also seems likely that the mutation mechanism for the smaller
7
deletions detected in this study will be different from the intermediate- and large-scale CNPs
reportedly affiliated with segmental duplications.
8
Distribution of functional categories for putative deleted SNPs
Category CEU deletions YRI deletions Entire HapMap
Coding non-synonymous 28 66 10,449
Coding-synonymous 17 37 6223
Intron 760 1,423 327,541
Locus-region 183 261 29,976
Mrna-utr 148 239 54,534
NA 3,154 4,360 577,183
Nonidentical 22 84 10650
Splice-site 0 0 60
Total 4,312 6,470 1,017,246
Table S.1: In order to assess the proportion of genes in deletions relative to the HapMap as a
whole, we identified every SNP that fell within a deletion-compatible region in CEU or YRI.
The rs#s for these SNPs were then used to obtain functional information from dbSNP (build
124). Functional information was also sought for every SNP in the Phase I HapMap. “Locus
region” indicates a SNP within 2 kb of a gene; NA, intergenic SNPs or SNPs for which
functional information was unavailable; nonidentical, SNPs with multiple but non-identical
annotations.
9
Frequencies of Predicted Deletions
CEU YRI Combined
Frequency 2MI+ 1MI+ 2MI+ 1MI+ 2MI+ 1MI+
1 copy 167 107 280 137 412 228
2 copies 34 44 74 92 100 126
3 copies 11 23 22 59 30 68
4 copies 12 21 11 31 21 50
5 copies 0 11 6 27 12 30
> 5 copies 4 21 3 46 11 76
Total 228 227 396 392 586 578
Table S.2: The frequency spectrum of a set of polymorphic loci contains information about
the demographic and selective forces acting upon the genome around those loci. By making
the simplifying assumption that overlapping deletion-compatible regions in different individ-
uals represent alleles of the same deletion polymorphism, we constructed a crude frequency
spectrum for deletion loci detected in this study. In order to better approximate the true
frequency spectrum, we did a second analysis, this time including individuals with deletion-
compatible regions that contain only 1 MI as well as the deletion- compatible regions with
2 or more MIs. Occasionally, a 1-MI deletion will overlap two separate 2-MI+ deletion loci;
when this occurs, we collapse both 2-MI+ loci into a single locus. This results in a slight
reduction in the number of loci called in the 1-MI+ analysis. We present the results of both
analyses side-by-side for comparison. These data should be treated with caution since there
will be both false positives and false negatives. Combined: frequencies in the combined CEU
and YRI samples; 1MI+: analysis including 1-MI deletions; 2MI+: analysis excluding 1-MI
deletions.
10
X
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Figure S.8: Genome map of observed deletions. This cartoon depicts the physical
location of each region of the genome found to be hemizygous in at least one individual.
Blue lines indicate the location of a region found in CEU, red lines correspond to YRI. The
centromeres are indicated by purple bars.
11
-15.0
-10.0
-5.0
0.0
5.0
10.0
15.0
% d
elet
ions
-%ge
nom
e
Molecular function unclassifiedNucleic acid bindingTranscription factorTransporterHydrolaseSignaling moleculeOxidoreductaseMembrane traffic proteinChaperoneMiscellaneous functionIsomeraseLyaseExtracellular matrixLigaseSelect calcium binding proteinSynthase and synthetaseCytoskeletal proteinTransfer/carrier proteinIon channelCell junction proteinTransferasePhosphataseSelect regulatory moleculeCell adhesion moleculeKinaseProteaseDefense/immunity proteinReceptor
Figure S.9: Analysis of deletion gene molecular function. We identified 267
genes in RefSeq which either overlapped or encompassed a predicted deletion. To in-
vestigate whether these genes were a random sample of genes from the genome, we
used the GeneID associated with each accession to query the PANTHER database at
http://panther.appliedbiosystems.com. These 267 genes fell into 28 molecular function cat-
egories at the highest level of ontological classification. For each category, we calculated the
proportion of all deletion genes that fall within that category, as well as the proportion of
all genes in the genome that fall within that category. The chart displays the difference in
percentage between the two proportions.
12
-15
-10
-5
0
5
10
15
20
% d
ele
tio
ns
-%
ge
no
me
Biological process unclassifiedNucleoside, nucleotide and nucleic acid metabolismTransportCell proliferation and differentiationCell cycleApoptosisAmino acid metabolismDevelopmental processesElectron transportCell structure and motilityIntracellular protein trafficOther metabolismBlood circulation and gas exchangeLipid, fatty acid and steroid metabolismOncogenesisCoenzyme and prosthetic group metabolismProtein targeting and localizationCarbohydrate metabolismMuscle contractionProtein metabolism and modificationNeuronal activitiesCell adhesionImmunity and defenseSensory perceptionSignal transduction
Figure S.10: Analysis of deletion gene biological processes We identified 267
genes in RefSeq which either overlapped or encompassed a predicted deletion. To in-
vestigate whether these genes were a random sample of genes from the genome, we
used the GeneID associated with each accession to query the PANTHER database at
http://panther.appliedbiosystems.com. These 267 genes fell into 25 biological process cate-
gories at the highest level of ontological classification. For each category, we calculated the
proportion of all deletion genes that fall within that category, as well as the proportion of
all genes in the genome that fall within that category. The chart displays the difference in
percentage between the two proportions.
13
Validation using Real-Time PCR.
Twelve predicted deletions at 10 loci were selected for validation using real-time PCR. We de-
signed a total of 18 primer pairs that overlapped or flanked SNPs within 10 putative deletion
loci, using Primer Express v2.0 software (ABI). The chosen SNPs represented a mix of both
Mendelian-inconsistent and uninformative SNPs. The primer sequences for our endogenous
control, exon 2 of von Hippel-Lindau 1 (vHL1), were obtained from previously published
work (Hoebeeck et al., 2005). Primers were designed to produce amplicons of approximately
100bp. Genomic DNA for seven CEPH inviduals (NA10857, NA12043, NA12044, NA12264,
NA12234, NA10863, NA10851) was obtained from Coriell Cell Repositories (Camden, NJ).
Real-time PCR reactions were run in 50µL volumes, using the Platinum SYBR Green qPCR
Supermix reagent cocktail (Invitrogen) and a dNTP mix (Promega), on an ABI Prism
7900HT (PE Applied-Biosystems). The PCR experiments used an initial ramp-up of 50◦
for 2 minutes, followed by 40 cycles of (95◦ for 30 seconds; 57◦ for 30 seconds). The reaction
was terminated by a final heat dissociation cycle of 95◦ for 15 seconds, 57◦ for 15 seconds,
95◦ for 15 seconds, to verify the presence of a single PCR product.
For each primer set, each sample was run in duplicate or triplicate for at least 4 different
concentrations of genomic DNA, within the three members of a trio believed to carry a
particular deletion. We used a relative standard curve method to assess each individual’s
genomic copy-number for a given target region, following the manufacturer’s instructions
(Applied Biosystems).
Results are shown in Figure S.11. All ten candidate deletion regions show patterns consistent
with deletion transmission from parent to child.
14
Validation regions examined by real-time PCR
Chromosome Start Length SNPs MIs PCRs Region
1 146,393,913 92,766 3 2 1 7
1 146,510,265 10,492 3 1 1 8
3 150,282,950 11,424 2 2 2 1
4 169,491,071 203,747 56 3 3 9
6 10,574,024 46,284 17 4 3 2
8 14,841,542 230,079 27 10 1 3
8 15,087,195 130,346 19 12 2 4
8 15,271,799 413,183 56 9 1 5
8 104,088,748 1,061 2 2 1 6
20 14,798,454 18773 6 4 3 10
Table S.3: Summary of 10 genomic regions used in validation. Start, beginning of deletion-
compatible region in base pairs (NCBI 34); Length, length of deletion-compatible region;
SNPs, number of SNPs within region; MIs, number of Mendelian incompatabilities detected
within region; PCRs, number of non-overlapping primer sets tested within region; Region,
number used to label PCRs in Figure S.11. The region with just one MI was part of a
multiple-MI region in an earlier HapMap build, when the experiment was designed. The
regions described here were selected on the basis of an early build of the HapMap (August
2004), so start and stop positions may not align precisely to a deletion identified in the final
analysis.
15
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Figure S.11: Validation of predicted deletions. Eighteen regions from twelve predicted dele-
tions were examined using quantitative PCR. The relative quantity of DNA in the predicted
deletion carriers (red points) was consistently lower than in the non-carrier parent for each
trio (blue points). The quantitative values for each region are normalized by setting the
blue points to 1. The dotted line indicates the 50% level predicted for hemizygous regions.
The numbers floating above each point can be used to match an experimental result to its
genomic region in Table S.3.
16
Deletion filtering to remove false positives.
As described in the Methods, we performed filtering to remove candidate deletions that may
not be transmitted deletions. That is, we discarded deletion regions that (i) contained one
or more Type II MIs, as these can indicate a somatic deletion in a parent; (ii) occurred
in unusual clusters of deletions in a trio that was an outlier in terms of total number of
deletions; (iii) occurred in known regions of somatic recombination.
In order to illustrate the nature of these changes, we provide the following summary of
candidate deletion regions removed. (i). One CEU trio, part of family 1459, was an extreme
outlier, with 91 deletion regions, compared to the CEU average of 15. The cell line for
individual CEPH1459.11, the father of this trio, has apparently ejected a substantial portion
of chromosome 1q (approximately 104 Mb), creating a large region with high rates of both
Type I and Type II MIs (such a pattern is expected for a chromosome that is transmitted
to the child, and then subsequently lost in the parent). The breakpoint lies within 1q21,
which is a noted hotspot for chromosome breaks in cells infected by human cytomegalovirus
(a close relative to Epstein-Barr virus) (Fortunato et al., 2000) and the break may therefore
have occurred during the transformation process. Not including the chromosome 1q ejection
or deletion regions removed due to other considerations, there were 8 deletions in YRI and 3
deletions in CEU that were compatible with cell-line specific deletion. (ii). We identified 3
YRI trios who appeared as outliers in terms of either their total number of deletions or the
number of deletions manifest on a single chromosome. Trios 17 and 45 both had abnormally
large numbers of MIs and deletions regions on chromosome 12. We were able to confirm
with the genotyping center responsible for the abberant genotypes that sample mix-ups
had occured with both trios (Martin Moorhead, personal communication). We removed all
candidate deletion regions from these two trios on chromosome 12; 22 regions from trio 17
and 20 regions from trio 45. YRI trio 23 was an outlier on chromosome 10, with 32 deletion
regions between 13.4Mb to 26.95Mb, a density comparable to the chromosome 1q ejection.
The MIs producing these deletions are all of category “A” (see Methods), with relatively
few Type II MIs (diagnostic of a cell-line deletion). There was not a noticeable elevation in
homozygosity for any member of the trio in this region (data not shown). The responsible
genotyping center has identified the single experiment in which aberrant genotypes in this
region of chromosome 10 were generated for the mother of this trio (NA18855), and has
requested that these data be removed from future HapMap releases (Sarah Hunt, personal
communication). (iii) Deletion regions recorded in the following regions of known somatic
recombination were discarded:
17
Region Chromosome Approx Start (Mb,NCBI34) Approx Stop (Mb,NCBI34)
IGK 2 89.00 89.50
TCRG 7 37.92 38.10
TCRB 7 141.60 141.96
TCRA and TCRD 14 20.55 21.01
IGH 14 104.00 105.00
IGL 21 20.70 22.00
In total, 26 CEU and 6 YRI deletions were discarded that fell in class (iii) but no other class.
MCMC implementation.
In this section, we describe the MCMC algorithm that was used to generate some of the
statistical results in the main text. The primary quantities of interest are the total number
of haploid deletions transmitted to the HapMap offspring (observed and unobserved), and
the true lengths of the deletions that were observed. The data used in this method consist
of the following:
N = {n1, ..., n23} where n23 is the number of deletions detected on the X chromosome and
ni is the number of deletions detected on chromosome i for all other i.
M = {m1, ..., mJ} where mj = (minj , maxj) gives the largest and smallest possible size
for observed deletion j, and J =∑23
i=1 ni. The physical distance between the first deletion-
incompatible SNP distal and proximal to the deletion define maxj, and the distance between
the most distal and proximal MI within the deletion-compatible region defines minj.
Additional information used in the analyses are
B = {b1, ..., b23} the sizes (bp) of the chromosomes and pi(l), the power to detect a deletion
of size l on chromosome i. This latter function was estimated by simulation as described in
the Methods section of the main text.
Model. We created a formal hierarchical model to relate the number and length of observed
deletions to the total number and length distribution of deletions in both the CEU and YRI
samples. In this model the true length of the jth detected deletion is lj, and L = {l1, ..., lJ}.
Then the frequency of a deletion of length l, denoted f(l), follows a gamma distribution with
shape parameter α and scale parameter β. The total number of deletions in the sample on
the ith chromosome is ti, T = {t1, ..., t23}. The ti are poisson distributed with rate 30cbi,
independent ∀ i. The number of deletions observed in the HapMap on chromosome ni, is
distributed
18
ni ∼ Bino(λi, ti)
where
λi =
∫ inf
0
pi(l)Ga(l; α, β)dl
which is the power to detect a deletion on chromosome i, unconditional on its length. We use
the Gauss-Kronrod 21-point integration rule to numerically evaluate λi (Press et al., 1992).
We use flat priors on the scale parameter, α, the shape paramter β, and on c, the total
number of deletions in the sample. The prior distribution of deletion sizes (ie, unconditional
on detection) will be Ga(α, β).
We now wish to sample from Pr(c, T, L, α, β|N, M). This is done using the following
Metropolis-Hastings algorithm.
Initalize all parameters with plausible starting values α(0), β(0), c(0), T (0), L(0). Then update
in turn as follows:
Step 1. Propose c′ ∼ N(c, σ21). If c′ ∈ (0, max) then accept c′ if a uniform random variable
u satisfies
u <
J∏i=1
Po(ti, 30c′bi)
Po(ti, 30cbi)=
J∏i=1
(c′
c
t
i)exp(30bi(c − c′))
Step 2. For each ti, propose t′i from a symmetric discrete distribution centered on ti. If
ti ≥ ni then accept t′i if
u <
J∏i=1
Po(ti, 30c′bi)
Po(ti, 30cbi)∗
Bin(ni; λi, t′
i)
Bin(ni; λi, ti)
=(30cbi)
(t′i−ti)ti!
ti!∗
(ti − ni)!
(t′i − ni)!(1 − λi)
(t′i−ti)
Step 3. Update each lj for the detected deletions, in turn, as follows. Simulate l′j ∼
Uni(minj , maxj). Accept l′j if
u <Ga(l′j; α, β)pi(l
′
j)
Ga(lj; α, β)pi(lj)
= (y′
y)(α−1)e((y−y′)/β)
Step 4. Propose α′ ∼ N(α, σ22). If α′ ∈ (0, maxα) then accept α′ if
u <
J∏i=1
Pr(lj|α′, β, detected)
Pr(lj|α, β, detected)
19
where Pr(lj|α, β, detected)
=
1Γ(α)Γ(β)
lα−1j e
−lj
β pi(lj)
λi(α, β)
Step 5. Propose β ′ ∼ N(β, σ23. If β ′ ∈ (0, maxα) then accept β ′ if
u <
J∏i=1
Pr(lj|α, β ′, detected)
Pr(lj|α, β, detected)
where Pr(lj|α, β, detected)
=
1Γ(α)Γ(β)
l(α−1)j e
(−lj
β)pi(lj)
λi(α, β)
We mix over possible deletion lengths by choosing li uniformly at random over the acceptable
interval (maximum possible size of deletion - minimum possible size of deletion).
In practice, we collect 10,000 samples of α(m), β(m), c(m), T (m), L(m) from the Markov chain
once it has reached stationarity.
The method was tested using simulated deletions from the HapMap data, and was found to
perform well at estimating c and L for a wide range of α and β (Figure S.12). As α approaches
0, however, the estimates of c and L become increasingly biased towards smaller values; we
therefore suspect that the values of L and c reported in this paper are underestimates.
20
0 50 100 150 200
020
040
060
080
0
True Length (kb)
Est
imat
ed le
ngth
(kb)
Figure S.12: Estimation accuracy of deletion lengths using our MCMC algorithm. Results
of analysis of simulated dataset based on CEU chromosome 2. The true length of each
simulated deletion is plotted against an estimation of the length by (a), using the maximum
possible length, red points; (b) the minimum possible length, which is the distance between
the most distal MI and most proximal MI within the deletion-compatible region, blue points;
(c) length estimated using MCMC, black points. Points from a completely accurate method
would fall along the black line. Three thousand deletions were simulated with sizes drawn
from a Gamma distribution with parameters α = 2.0 and β = 2 × 104.
21
Estimated Rates of CEU Trio Configurations in Phase I HapMap
Center Platform A B C D E F G Total
perlegen perlegen 0.00018(±5.1 × 10−5) 0.00017(±4.9 × 10−5) 5.6 × 10−5(±2.8 × 10−5) 0.67(±0.0031) 0.055(±0.00088) 0.054(±0.00087) 0.22(±0.0018) 70980
broad illumina 0.00026(±1.2 × 10−5) 0.00028(±1.2 × 10−5) 0.00045(±1.5 × 10−5) 0.61(±0.00056) 0.066(±0.00019) 0.066(±0.00019) 0.26(±0.00037) 1927348
broad sequenom/affy 0.00042(±1.2 × 10−5) 0.00039(±1.1 × 10−5) 0.00019(±7.8 × 10−6) 0.5(±0.0004) 0.086(±0.00017) 0.085(±0.00017) 0.33(±0.00032) 3090378
chmc illumina 0.00032(±1.3 × 10−5) 0.00032(±1.3 × 10−5) 0.00022(±1.1 × 10−5) 0.52(±0.00052) 0.081(±0.00021) 0.08(±0.0002) 0.32(±0.00041) 1921888
chmc sequenom 0.00063(±2.7 × 10−5) 0.00057(±2.5 × 10−5) 0.00036(±2 × 10−5) 0.52(±0.00077) 0.086(±0.00031) 0.085(±0.00031) 0.31(±0.00059) 884338
illumina illumina 9.1 × 10−5(±4 × 10−6) 9.5 × 10−5(±4.1 × 10−6) 7.1 × 10−5(±3.5 × 10−6) 0.47(±0.00029) 0.087(±0.00012) 0.086(±0.00012) 0.35(±0.00025) 5805924
imsut-riken invader 1.4 × 10−5(±1.5 × 10−6) 1.4 × 10−5(±1.5 × 10−6) 0(±0) 0.56(±0.0003) 0.074(±0.00011) 0.076(±0.00011) 0.29(±0.00022) 6361818
mcgill-gqic illumina 6.3 × 10−5(±4.6 × 10−6) 5.2 × 10−5(±4.2 × 10−6) 3.1 × 10−5(±3.3 × 10−6) 0.54(±0.00043) 0.078(±0.00016) 0.077(±0.00016) 0.31(±0.00032) 2906610
bcm mip 0.00053(±1.9 × 10−5) 0.00059(±2 × 10−5) 0.00089(±2.5 × 10−5) 0.54(±0.00062) 0.077(±0.00023) 0.078(±0.00023) 0.3(±0.00046) 1411300
sanger illumina 0.00054(±9 × 10−6) 0.00058(±9.3 × 10−6) 0.0014(±1.4 × 10−5) 0.55(±0.00029) 0.075(±0.00011) 0.076(±0.00011) 0.3(±0.00021) 6671012
ucsf-wu PE FP 0.00031(±2.9 × 10−5) 0.00029(±2.8 × 10−5) 0.00021(±2.4 × 10−5) 0.49(±0.0012) 0.088(±0.0005) 0.086(±0.00049) 0.34(±0.00097) 359212
Table S.4: We present here the rate matrix of autosomal CEU trio configurations used in our simulations. For each combination of
genotyping center and genotyping platform, we have estimated the frequency of each configuration (i.e. A-G; see Methods) considered
by our deletion-detection algorithm. The standard error follows in parantheses. Each center-platform combination has been considered
separately to accomodate inter-platform and inter-operator variation in these rates.
22
Estimated Rates of YRI Trio Configurations in Phase I Hapmap
Center Platform A B C D E F G Total
broad illumina 0.00041(±1.5 × 10−5) 0.0004(±1.5 × 10−5) 0.00049(±1.6 × 10−5) 0.55(±0.00054) 0.076(±0.0002) 0.078(±0.0002) 0.29(±0.00039) 1910458
broad sequenom/affy 0.0016(±2.3 × 10−5) 0.0015(±2.3 × 10−5) 0.00039(±1.2 × 10−5) 0.51(±0.00042) 0.087(±0.00017) 0.087(±0.00017) 0.31(±0.00032) 2917710
chmc illumina 0.00042(±1.4 × 10−5) 0.00043(±1.4 × 10−5) 0.00015(±8.2×−6) 0.48(±0.00047) 0.088(±0.0002) 0.088(±0.0002) 0.34(±0.0004) 2195580
chmc sequenom 0.00093(±4.3 × 10−5) 0.00097(±4.4 × 10−5) 0.00034(±2.6 × 10−5) 0.51(±0.001) 0.088(±0.00042) 0.085(±0.00041) 0.31(±0.00079) 502676
illumina illumina 6.6 × 10−5(±3.4×−6) 5.6 × 10−5(±3.1×−6) 2.7 × 10−5(±2.2×−6) 0.48(±0.00029) 0.089(±0.00012) 0.089(±0.00012) 0.35(±0.00024) 5809386
imsut-riken invader 0.00012(±4.4×−6) 0.00012(±4.5×−6) 6.8 × 10−7(±3.4 × 10−7) 0.53(±0.0003) 0.082(±0.00012) 0.081(±0.00012) 0.31(±0.00023) 5923486
mcgill-gqic illumina 0.00024(±9.2×−6) 0.00024(±9.1×−6) 3.5 × 10−5(±3.5×−6) 0.5(±0.00042) 0.084(±0.00017) 0.085(±0.00017) 0.33(±0.00034) 2846328
bcm mip 0.0013(±3.1 × 10−5) 0.0014(±3.2 × 10−5) 0.0014(±3.1 × 10−5) 0.52(±0.00062) 0.081(±0.00024) 0.082(±0.00024) 0.31(±0.00048) 1367738
sanger illumina 0.00092(±1.1 × 10−5) 0.00083(±1.1 × 10−5) 0.0017(±1.6 × 10−5) 0.52(±0.00027) 0.081(±0.00011) 0.081(±0.00011) 0.31(±0.00021) 6974062
ucsf-wu PE FP 0.00053(±4.1 × 10−5) 0.00043(±3.7 × 10−5) 0.00046(±3.8 × 10−5) 0.5(±0.0013) 0.09(±0.00054) 0.089(±0.00053) 0.32(±0.001) 314220
Table S.5: We present here the rate matrix of autosomal YRI trio configurations used in our simulations. For each combination of
genotyping center and genotyping platform, we have estimated the frequency of each configuration (i.e. A-G; see Methods) considered
by our deletion-detection algorithm. The standard error for each rate follows in parantheses. Each center-platform combination has
been considered separately to accomodate inter-platform and inter-operator variation in these rates.
23
Enumeration of Trio Configurations
Parent 1 Parent 2 Child Class
AA AA AA D
AA AA AB G
AA AA BB C
AA AA NN D
AA AB AB G
AA AB AA E/F
AA AB BB A/B
AA AB NN D
AA NN AA D
AA NN AB G
AA NN BB A/B
AA NN NN D
NN NN AA D
NN NN AB G
NN NN BB D
NN NN NN D
BB BB AA C
BB BB AB G
BB BB BB D
BB BB NN D
BB AA AA A/B
BB AA AB G
BB AA BB A/B
BB AA NN D
BB AB AA A/B
BB AB AB G
BB AB BB E/F
BB AB NN D
BB NN AA C
BB NN AB G
BB NN BB D
BB NN NN D
AB AB AB G
AB AB AA G
AB AB BB G
AB AB NN G
AB NN AB G
AB NN AA E/F
AB NN BB E/F
AB NN NN D
Table S.6: Here we present in full the coding scheme used in our deletion-detection algorithm
(See Methods). There are 43 total possible trio genotype configuration for a diallelic SNP
locus, including potential missing data configurations. This number can be reduced to 40
when symmetrical configurations are considered as a single type (e.g., mom AB dad AA =
mom AA dad AB).
24
References
Fortunato, E. A., Dell’Aquila, M. L., and Spector, D. H. (2000). Specific chromosome 1
breaks induced by human cytomegalovirus. Proc Natl Acad Sci U S A, 97:853–8.
Hoebeeck, J., vanderLuijt, R., Poppe, B., DeSmet, E., Yigit, N., Claes, K., Zewald, R.,
deJong, G. J., DePaepe, A., Speleman, F., and Vandesompele, J. (2005). Rapid detection
of VHL exon deletions using real-time quantitative PCR. Lab Invest, 85:24–33.
Lupski, J. R. (1998). Genomic disorders: structural features of the genome can lead to DNA
rearrangements and human disease traits. Trends Genet, 14:417–22.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical
Recipes in C: The Art of Scientific Computing. Cambridge University Press, 2nd edition.
Sharp, A. J., Locke, D. P., McGrath, S. D., Cheng, Z., Bailey, J. A., Vallente, R. U., Pertz,
L. M., Clark, R. A., Schwartz, S., Segraves, R., Oseroff, V. V., Albertson, D. G., Pinkel,
D., and Eichler, E. E. (2005). Segmental duplications and copy-number variation in the
human genome. Am J Hum Genet, 77:78–88.
Tuzun, E., Sharp, A. J., Bailey, J. A., Kaul, R., Morrison, V. A., Pertz, L. M., Haugen, E.,
Hayden, H., Albertson, D., Pinkel, D., Olson, M. V., and Eichler, E. E. (2005). Fine-scale
structural variation of the human genome. Nat Genet.
25