![Page 1: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/1.jpg)
1
Genes and MS in Tasmania, completed.
Lecture 7, Statistics 246February 12, 2004
![Page 2: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/2.jpg)
2
Towards a sharing statistic
Our aim was to come up with a statistic that effectively describes haplotype sharing differences between case and “control” haplotypes
The sharing statistic should be largest at markers closest to a disease locus, as haplotype sharing there should- extend the furthest; &
- the association of disease with particular haplotypes should be strongest
![Page 3: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/3.jpg)
3
Nonparametric haplotype sharing analysis
Why nonparametric, rather than likelihood-based methods?
• Likelihood methods make assumptions regarding the genealogy of the population, and we don’t how many of these assumptions are robust to violations.
• Likelihood methods are computationally intensive, especially for genome wide scans, where these is a need to maximize over the very large state space of possible ancestral haplotypes (MCMC)
• Likelihood methods have a hard time at the HLA region, because the LD there is extremely high and non uniform (block-like structure)
• Simpler statistics will probably do better here, unless we can model background LD
![Page 4: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/4.jpg)
4
Haplotype sharing statistics for genome wide scan data cf. fine mapping
• Previous (usually likelihood-based) statistics have concentrated on fine mapping and the exact localization of a variant allele. They assume a signal exists.
• For us, localization was not the primary interest. Rather, detection was our main interest, using a genome-wide scan
• We needed something that was not as computationally intensive as DHSMAP (McPeek & Strahs, 1999), BLADE (Liu et al, 2001), DMLE+ (Rannala & Reeve, 2001), or the shattered coalescent (Morris et al , 2002).
![Page 5: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/5.jpg)
5
Haplo_clusters (Melanie Bahlo)
• Calculates a sharing statistic at every marker
• Obtains a p-value at every marker using a permutation test
• Allows for several clusters of ancestral haplotypes (allelic heterogeneity)
![Page 6: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/6.jpg)
6
3 5 9 8 7 6 10 1 5 4 3 2 5 Cases 3 2 1 3 7 6 10 1 5 4 1 3 21 2 1 3 5 6 10 1 5 2 1 3 42 3 7 3 1 6 10 9 1 1 2 5 6
5 9 1 1 4 1 3 1 2 3 1 9 87 6 5 3 1 3 2 1 5 9 7 9 1
Controls 7 1 2 1 1 3 5 7 1 5 1 3 29 3 9 2 1 2 7 5 3 4 2 2 5
Testing for shared haplotypes
Score for haplotype sharing (- log p)
Pter- -Qter
![Page 7: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/7.jpg)
7
Sharing drop-off & allelic heterogeneity
Marker Proportions of Cases Proportions of Controls
1
2
3
4
= Cluster 1 haplotypes
= Cluster 2 haplotypes
= neither cluster 1 nor 2 haplotypes
![Page 8: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/8.jpg)
8
Haplo_cluster in action
Haplotype 1 1 1 2 1 3 2 1 3 1
Controls 1 1 1 0 1
Cases 0 0 0 3 0
Example: Sorting on marker 1 for a sample of 3 case and 4 control haplotypes
2 1 3
2 1 3
2 1 4
1 1 2
1 2 3
1 3 3
3 1 2 After sort on haplotype consisting only of marker 1, calculate a chi-square statistic, and move onCases Controls
Haplotype 1 2 3
Controls 3 0 1
Cases 0 3 0
After sorting on haplotype consisting of marker 1 and marker 2, calculate a chi-square statistic, and ….
Eventually stop, and sum the chi-square statistics. Then repeat for a suitably large number of random permutations of cases and controls.
![Page 9: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/9.jpg)
9
Statistic to evaluate haplotype sharing
Sharing statistic is 2 based, using the idea of multiple ancestral haplotypes (clusters) which are “grown” starting at each marker examined in the scan.
Significance is evaluated via a permutation test: choose a random permutation of the pooled cases and controls, and recalculate the statistic; repeat ~20,000 times.
A recursive form for the estimator and and the SD of the p-value was used, to enable early termination of program
€
Si = χ i, j ,k2
k=1
K
∑j=1
K
∑
χ i, j ,k2 = χ 1
2 test for associationbetweenthenumber
of case and control haplotypes still sharing the
ancestral haplotype of cluster k at marker j,
after starting at marker i.
![Page 10: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/10.jpg)
10
The permutation test
The idea is this. We have 170 cases and 105 controls, and at any particular locus, we calculate the value of our statistic, calling it S.
Now pool our cases and controls into 275 individuals, and sample 170 to be “cases” at random from the 275, calling the remainder “controls”. For this first artificial set of cases and controls, calculate the value of our statistic, S1 say.
Next, we repeat this procedure 9,999 more times, say, obtaining values S2 , S3 , S4 … S10,000 . As long as 10,000 is sufficiently many random permutations, we can get a good estimate of the p-value of our initial statistic relative to our empirically estimated null distribution, as p = #{i: Si > S }/10,000.
![Page 11: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/11.jpg)
11
Exercises
1. How should we decide what number of resamplings is large enough?
2. Explain in the simple case of a 22 table of cases and controls cross-classified as diseased and healthy, how using all possible resamplings, rather than a fixed size random sample, leads to the p-value for the exact test.
3. To avoid carrying out an unnecessarily large number of permutations, the proportion of resampled values of our statistic exceeding the value S can be monitored. Can you describe a stopping rule for the random resamplings that should lead to “accurate enough” p-values, without going to the full number each time?
![Page 12: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/12.jpg)
12
Haplo_clusters - Output
-opt 1 Genetic distances used to decide order of markers to sort on-c 1 The number of clusters of haplotypes to look for = 1-miss 1 The missing data is replaced randomly using the 2 marker haplotype information.-share 5 The number of haplotypes needed to share = 5The standard deviation p values are calculated to 0.01*phat.Marker names have been provided and will be used in the output files.# of case haplotypes = 338# of contol haplotypes = 208# of markers = 11# of perms = 100000Marker Mapdistance Chi_Square p sd(p) -log(p) perms====================================================D21S1911 0 5.34 4.44e-01 4.44e-03 0.35 12510D21S1904 0.85 6.17 3.63e-01 3.63e-03 0.44 17577D21S1899 10.36 5.89 4.37e-01 4.37e-03 0.36 12876D21S1922 16.46 2.97 6.83e-01 6.83e-03 0.17 4636D21S1884 17.26 4.74 4.14e-01 4.14e-03 0.38 14135D21S1914 20.82 6.49 3.38e-01 3.38e-03 0.47 19571D21S263 28.97 4.06 5.24e-01 5.24e-03 0.28 9077D21S1252 39.41 1.18 8.66e-01 8.65e-03 0.06 1553D21S1919 42.51 1.38 8.51e-01 8.51e-03 0.07 1751D21S1255 43.81 2.24 7.24e-01 7.24e-03 0.14 3805D21S266 51.51 3.86 5.70e-01 5.70e-03 0.24 7557===================================================
![Page 13: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/13.jpg)
13
Haplo_clusters - Output II
Table of haplotypesMarker Cluster Haplotype Length(Haplotype)
D21S1911 D21S1904 D21S1899 D21S1922 D21S1884 D21S1914 D21S263 D21S125===================================================D21S1884 1 - - 6 3 3 8 11 -# of haplos: - - 5 82 163 22 2 -Chi-square: - - 0.2 0.0 3.2 0.1 1.2 -
D21S1914 1 - - 7 3 3 5 2 -# of haplos: - - 4 16 34 58 10 -Chi-square: - - 2.5 0.0 0.5 2.1 1.3 -
D21S263 1 - - - 4 3 10 2 5# of haplos: - - - 3 5 24 138 9Chi-square: - - - 1.9 1.2 0.7 0.0 0.3
D21S1252 1 - - - - - - 4 5# of haplos: - - - - - - 6 83Chi-square: - - - - - - 0.6 0.1
D21S1919 1 - - - - - - 2 2# of haplos: - - - - - - 3 7Chi-square: - - - - - - 0.3 0.0
Etc etc etc
===================================================Time taken (m) = 55, 23/6/2003, 11:15:12Haplo_cluster.pl$Revision:1.15$
![Page 14: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/14.jpg)
14
Output for Chromosome 6
HLA Region – p-value <0.00001. Peak contains D6S105, MOGCA,
![Page 15: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/15.jpg)
15
Empirical distributions of statistic, chr 6
Off scale
![Page 16: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/16.jpg)
16
Comparison of Two Positive Controls against Two Negative Controls
Cases versus Controls
Cases versus Untransmitteds
Untransmitted versus Controls
Controls versus Untransmitted
![Page 17: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/17.jpg)
17
Uniform qq-plots and multiple testing
When we carry out ~800 tests, as we have here, we expect to see many quite small p-values under the combined null hypothesis of no case-control haplotype differences anywhere, specifically, about 40 smaller than the usual 5% cutoff. In practice, we believe that at most a few of these 800 nulls will be false. How do we adjust our p-values for this multiplicity of tests?
One fairly severe way is known as the Bonferroni adjustment: to multiply all
our p-values by 800. Another approach is this: rather than compare our 800 p-values to the single test 5% cutoff, we compare them all to that value which the smallest of 800 i.i.d uniforms will exceed 95% of the time.
Exercises 1. Prove that Bonferroni procedure is conservative, in that the family-wise
type 1 error (the chance of one or more type 1 errors) under the assumption that all the null hypotheses are true, is ≤ 5%.
2. Calculate the 5th percentile of the smallest of 800 i.i.d. uniforms. How close is it to the Bonferroni 5th percentile?
![Page 18: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/18.jpg)
18
Uniform qq-plots and multiple testing, cont. The procedure just described is still conservative, for two reasons. Firstly, the p-values are not independent, though they should be
identically distributed under the null hypothesis. There are ways to incorporate this into our analysis, the most direct being to estimate the joint resampling distribution of the test statistics for every marker. This can be computationally prohibitive, especially if we also want address the next point, which is:
Only the smallest of the p-values should be compared to the smallest of an i.i.d. or suitably dependent sequence of 800 p-values. The second smallest p-value should more correctly be compared to something slightly different, and so on. This leads is to the notion of step-wise multiple testing procedures. Resampling-based stepwise multiple testing corrections can be very computationally intensive.
In our present case we did no more than create a uniform qq-plot, and
look at the number of loci “off the line” at the low end. Why? In part for computational reasons; in part, because we plan to follow up “promising” regions even if they do not have small adjusted p-values.
![Page 19: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/19.jpg)
19
Distribution of p-values:uniform qq plots
Expected
Observed
Observed
Expected
![Page 20: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/20.jpg)
20
Reproducibility: same datasets, different random number seeds
![Page 21: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/21.jpg)
21
Similar method/problem
Similar method
Haplotype Pattern Mining (Toivonen et al, 2000). Ingileif Hallgrimsdottir (Statistics, UCB) modified and extended this method, and her (blindly derived) results were very similar to those obtained using Haplo_cluster on the MS data.
Similar problem
A study of bipolar disorder in the Central valley of Costa Rica (Service et al, 2001, Ophoff et al, 2002) involves an admixed population of Amerindian and Spanish people, few founders, little immigration, ~300 years old. They use likelihood methods on 3-locus haplotypes, but didn’t use controls.
![Page 22: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/22.jpg)
22
What next for the MS study (apart from more analysis)?
• A close study of the MHC (HLA) region was conducted and published
• Relatedness of cases and controls was studied more carefully, and a few “too close” relatives identified and removed, but leaving the main conclusions unchanged
• Fine mapping around peaks was carried out: some peaks were strengthened, others disappeared. Further fine mapping under way.
• The Tasmanian cases are being joined by ethnically similar cases from the mainland, and genotyping of these new individuals in candidate regions is under way
• International collaboration is also under way
• We want to find genes and amino acid changes, if at all possible
![Page 23: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/23.jpg)
23
Fine mapping: two regions
0
1
2
3
4
B
B
B BJ
J
J JH
H H HP P
P
P0
1
2
3
4
5A #3
B B
B
B
B
BB
BB B B B BJ J
JJ
J J
J JJ J
J J JH H
H H H
HH
H
H
HH
H HP P P P P PP
P
PP
P P P
B #3
B B
B B B B BB
B
BB B
B
J
J J JJ
J
J
J J
J JJ
JHH
H
H
HH
H
H H H HH HP
PP P
PP P
P P P P P P0
1
2
3
4
B
BB
BJ
J
J
JH
H
H
HP
PP
P0
1
2
3
4
A #4
B #4
-log P10
B Case vs. Control J Case vs. Case UT
H Control vs. Case UT P Case UT vs. Control
![Page 24: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/24.jpg)
24
Relatedness in cases and controls
We assume that our cases and controls are mostly representative of the “Tasmanian population”.
If they are too closely related (within cases or controls)
we might expect bias in our sharing statistic.
If they are not closely enough related (within cases or controls) we might expect trouble detecting a signal.
![Page 25: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/25.jpg)
25
This pedigree is similar to the type of pedigree found in Tasmania. The “affected” individuals are represented by the filled in symbols.
![Page 26: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/26.jpg)
26
Determining the relatedness of Tasmanians based on GWS data
• Determine the level of relatedness of all pairs based on the genome wide scan data (another HMM analysis)
• We found several pairs which were much more closely related than the 10-12 meioses (6-8 generations) expected– 10 pairs in the case data– 6 pairs in the control data– 2 pairs in the case and control data
• Some of these relationships were subsequently verified with further genealogical research
• We re-did the analyses without these people
![Page 27: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/27.jpg)
27
Does having closely related cases or controls make a difference?
Cases versus Controls
Cases versus Controls (relateds removed)Cases versus Untransmitteds
Cases versus Untransmitteds (relateds removed)
![Page 28: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/28.jpg)
28
HLA Region & MS
• MS is believed to be an autoimmune disease (similar to type I diabetes)
• HLA association with MS previously identified
• One or more genes?
Log linear modelling with partial haplotypes suggests
that two regions were responsible and that these did not interact
![Page 29: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/29.jpg)
29
The HLA complex
Klein J. et al New Eng J Med, 2000; 343:702-709
An extremely gene-rich region.
![Page 30: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/30.jpg)
30
QuickTime™ and aGraphics decompressorare needed to see this picture.
B B BB
BB
B BB B
B BB B B
B
BB
B
BB
B B B
BB
B BB
B
B B
B B
B B
B B
B B
BB
B
B
B
B B
B B
J J JJ
JJ
J JJ J
J J JJ J
J
J J
J
JJ
J J JJ J
J
J
JJ
J JJ J
J J
J J J J
J
J
J
J
J
JJ
J J
0
0.5
1
1.5
2
2.5
B -log10 (P-value) 1000 perms
J -log10 (P-value) 5000 perms
850 kb850-Kb
Microsatellite markers that spanned the HLA complex generated a peak of association in an 850-Kb segment of the class I region
We have implicated an 850-Kb class I region in MS
![Page 31: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/31.jpg)
31
![Page 32: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/32.jpg)
32
• The TNF locus + 15 other class III genes have no influence on disease - association due to strong LD with DR15
Genetic dissection of the HLA region by haplotype analysis
• The HLA region encodes at least two independent susceptibility loci for MS
III III
MOG F G A E C B TNFDRB1 DQB1
DPB1
DRB1*1501-DQB1*0602
√ √X
(Rubio et al. 2002 AJHG)
![Page 33: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/33.jpg)
33
I III II3.6 Mb5.1 Mb
(~1 cM)
D6
S2
99
An extended haplotype across HLAconfers increased risk to MS
D6
S1
05
D6
S4
64
D6
S2
22
3
MO
GC
A
D6
S2
65
5
HL
A-F
D6
S5
10
DQ
B1
DR
B1
D6
S2
91
3 6 5 3 3 5 1 *1501 *0602Ancestralhaplotype
RR=4.3
RR=5.7
(DR15)
![Page 34: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/34.jpg)
34
AcknowledgmentsMCPHR, HobartIngrid van der MeiTrish GroomKristen HazelwoodJane PittawayRhonda McCoyLyn HallTracy LoweNatasha Newton Emma StubbsMichele SaleMaree RingAnnette BanksJoan CloughTim AlbionJo DickinsonShelly BrownSue SawbridgeDeirbhile O’ByrneBruce TaylorStan SjeicaAndrew HughesBozidar DrulovicTerry Dwyer
WEHI
Justin RubioLaura JohnsonRachel BurfootStewart HuxtableSimon Foote
ANRMSF, Canberra.Rex Simmons
MCRI
Funding: The Genes-CRC The National Multiple Sclerosis Society (USA) Department of Neurosciences RMH MS Australia NH&MRC (Australia)
VTIS
The Tasmanian and Victorian public
The MS Societies of Victoria and Tasmania
Brian TaitMike Varney
Bob Williamson
The AGRF (Melbourne)
RMHNiall TubridyJo BakerJohn Cary
Trevor KilpatrickHelmut ButzkuevenMark Marriot
Melanie BahloJim StankovichChris Wilkinson
![Page 35: 1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004](https://reader036.vdocument.in/reader036/viewer/2022062407/56649f445503460f94c65526/html5/thumbnails/35.jpg)