1 genes and ms in tasmania, cont. lecture 6, statistics 246 february 5, 2004
TRANSCRIPT
1
Genes and MS in Tasmania, cont.
Lecture 6, Statistics 246February 5, 2004
Nature and numbr of relatives needed to give accurate haplotypes
Exercise. Explain why it is that when we have both sets of parental genotypes, and the markers are reasonably polymorphic, we can reconstruct an individual’s haplotypes with high probability. What are the difficult cases?
If we have no parents, or just one parent, and grandparents’, siblings’ or offsprings’ genotypes are available, which are most informative for an individual’s haplotype reconstruction?
3
Simulation Study
Simulated many different types of pedigrees 300 times each to see which constellations of relatives give the best opportunity of being able to reconstruct haplotypes correctly.
Ranking the contributions in order of importance (assuming that the proband has been genotyped):
1. Parents
2. Grandparents & Siblings
3. Offspring
4
Genotyping
We used STR (short tandem repeat)
also known as microsatellite markers
…AGCTAGCGCGC….GCGCGGCATTA…
…AGCTAGCGCGC….GCGCGGCGCATTA…
Eventual plan: 5 cM genome wide scan (~ 800 markers) with dinucleotide STRs
Data collected in the Tasmanian MS Study
6
MS study in Tasmania: data
Collected 170 (out of an estimated 300) MS cases and 105 controls, and a constellation of ~ 4 relatives for each
Created a case/control study with 338 case haplotypes and 208 control haplotypes
Genotyping carried out at the Australian Genome Research Facility: almost 1 million genotypes (the 2nd largest genotyping project ever carried out in Australia
7
Cases (170) Controls (105)
Grdpts 29 (4%) 12 (3%)
Parents 174 (22%) 123 (30%)
Siblings 374 (46%) 215 (52%)
Spouse 67 (7%) 14 (3%)*
Offspring 168 (20%) 50 (12%)
Relatives of cases and controls
809
17 (1%)Other 0
414*
Some issues associated with the data preparation
9
Errors, errors and errors
• Marker location errors: allocation to wrong chromosome, wrong order, map distances out, Généthon (Dib et al,1996), Marshfield (Broman et al,1998), DeCODE (Kong et al, 2001) included a physical map,
• Pedigree (relationship) errors: PREST (McPeek & Sun)• Genotyping errors (caused by assay or analysis): ones
causing Mendelian inconsistencies; ones which don’t, PEDCHECK (O’Connell), SIBMED (Douglas et al), MERLIN (Abecasis et al)
• Data handling errors (e.g. mixed up samples)• Binning (allele labelling) errors: inconsistencies over time
10
Error checking: a little detail
With genome-wide genotypes, moderately close relationships can be confirmed or falsified: 7 paternity errors (6 incorrect fathers, 1 incorrect mother)
Mix-ups typically stand out (2 DNA sample swaps, 2 duplicate samples, 1 case of contaminated DNA, 1 adopted child unrelated to anyone else
Mendelian checks picked up many genotyping errors: 1,472 inconsistencies (0.15% genotyping errors); 15 markers removed; using Mendel on the X found 3 data entry errors and 4 cases where the recorded sex was wrong
Multilocus methods can pick up more, in effect identifying close double recombinants: 58 errors inferred by this method and put to missing
Other errors demanded special methods.
11
Unforeseen Problem
Marker binning was not consistent over time. Genotyping at 796 markers took over 2 years. Heuristic approach:
• Look at all markers with allele bin differences of 1 bp• Seek large frequency differences: 2 allele by box• Carry out allele binning slippage test (for pairs of adjacent
alleles and boxes): 2
• Markers were flagged if any of the above, and examined for systematic trends
A founder is an individual with no parent in the sample.
12
Example output showing partial allele slippage
Absolute frequencies for given allele (106) in each box is shown in (time) order of genotyping
Alleles in size order
Summary information
Note slippage of allele 104 into allele 106 for Box 7 (yellow)
(Time) order of Genotyping - Box 1, 2+3, 5, 6, 7, 21-23, 24-27.Numbers indicate number of individuals in each box
13
Isolated bin (probably slippage)
14
Example of highly polymorphic marker
15
Box 1 - all alleles shifted +1bp
16
Box 1 Alleles 150, 154 shifted +1bp
17
Fixing allele calls
Need to track changes carefully
18
Obtaining haplotypes
Haplotypes were reconstructed using the Lander-Green-Kruglyak algorithm (Genehunter/Merlin/Allegro).
We’ll go into the details of the algorithm later this lecture, or in the next.
Appropriate case and control datasets with these haplotypes were then prepared. Here’s how, from Genehunter output.
19
Genehunter Output• The genotype data for family MS003 (input)
***** MS003 0.000302 0 0 1 8 11 7 4 5 9 8 6 3 6 5 2 9 5 7 4 6 5 1 4 6 6303 0 0 1 5 10 5 3 3 7 1 1 2 5 5
5 4 5 7 1 4 4 1 3 4 8301 303 302 2 5 10 5 3 3 7 1 1 2 5 5
8 11 7 4 5 9 8 6 3 6 5304 303 302 0 5 4 5 7 1 4 4 1 3 4 8 2 9 5 7 4 6 5 1 4 6 6305 303 302 0 5 10 5 3 3 7 1 1 2 5 5
8 11 7 4 5 9 8 6 3 6 5
MS003 301 303 302 2 2 5 8 10 11 5 7 3 4 3 5 7 9 1 8 1 6 2 3 5 6 5 5MS003 302 0 0 2 1 2 8 9 11 0 0 7 4 0 0 0 0 0 0 1 6 0 0 0 0 5 6MS003 303 0 0 1 1 5 5 4 10 0 0 7 3 0 0 0 0 0 0 1 1 0 0 0 0 5 8MS003 304 303 302 2 0 2 5 4 9 5 5 7 7 1 4 4 6 4 5 1 1 3 4 4 6 6 8MS003 305 303 302 1 0 5 8 10 11 5 7 3 4 3 5 7 9 1 8 1 6 2 3 5 6 5 5
• The haplotype reconstruction for family MS003 (output)
Proband
Father’s transmitted haplotypeMother’s transmitted haplotype
Untransmitted haplotypes
Extracting untransmitted haplotypes from GENEHUNTER
Three types of controls: untransmitted haplotypes (akin to controls in TDT), haplotypes from matched controls to the affecteds, random controls
To derive the untransmitted haplotypes:
use GENEHUNTER to generate the haplotypes (creates haplo.dump file)
extract the untransmitted haplotype
use reconstructed haplotypes of the parents to find the untransmitted haplotype of the affected by negation
Example:Affected’s haplotypes Haplotypes of Parents of affected
2 5 2 12 10 6 8 2 3 0 0 1 0 2 1 2 2 1 1 7 7
1 2 3 13 12 9 7 2 5 2 12 10 6 8 1 2 3 13 12 9 8
Untransmitted haplotypes are: 2 3 0 0 1 0 2, 1 2 2 1 1 7 8
Assessing haplotype sharing
22
Nonparametric haplotype sharing analysis
Why nonparametric, rather than likelihood-based methods?
• Likelihood methods make many assumptions regarding the genealogy of the population. We don’t how many of these assumptions are robust to violations.
• Likelihood methods are computationally intensive, perhaps prohibitively so, especially for genome wide scans where these is a need to maximize over the very large state space of possible ancestral haplotypes (MCMC)
• Likelihood methods have a hard time at HLA because the LD there is extremely high and non uniform (block-like structure)
• Simpler statistics will do better here unless we can model background LD
23
Haplotype sharing statistics for genome wide scan data cf. fine mapping
• Previous statistics (mainly likelihood based) concentrated on fine mapping and the exact localization of variant. They assume a signal exists.
• For us, localization is not the primary interest, rather, detection is the main interest in a genome-wide scan.
24
Towards a sharing statistic
• Our aim was to come up with a statistic that describes haplotype sharing effectively
• At the markers closest to the disease locus the sharing statistic should be particularly large as the haplotype sharing should– extend the furthest, and also– the association of disease with particular haplotypes (alleles in
the single marker case) should be strongest there
• We needed something that was not as computationally intensive as e.g. DHSMAP or BLADE
25
3 5 9 8 7 6 10 1 5 4 3 2 5 Cases 3 2 1 3 7 6 10 1 5 4 1 3 2
1 2 1 3 5 6 10 1 5 2 1 3 4 2 3 7 3 1 6 10 9 1 1 2 5 6 5 9 1 1 4 1 3 1 2
3 1 9 8 7 6 5 3 1 3 2 1 5 9 7 9 1 Controls 7 1 2 1 1 3 5 7 1 5 1 3 29 3 9 2 1 2 7 5 3 4 2 2 5
Testing for shared haplotypes
Score for haplotype sharing (- log p)
Pter- -Qter
26
Sharing drop-off & allelic heterogeneity
Marker Proportions of Cases Proportions of Controls
1
2
3
4
= Cluster 1 haplotypes
= Cluster 2 haplotypes
= neither cluster 1 nor 2 haplotypes
27
Haplo_clusters (Melanie Bahlo)
• Calculates a sharing statistic at every marker
• Assigns significance using a permutation test
• Allows for several clusters of ancestral haplotypes (allelic heterogeneity)