1 genes and ms in tasmania, cont. lecture 6, statistics 246 february 5, 2004

1

Genes and MS in Tasmania, cont.

Lecture 6, Statistics 246February 5, 2004

Nature and numbr of relatives needed to give accurate haplotypes

Exercise. Explain why it is that when we have both sets of parental genotypes, and the markers are reasonably polymorphic, we can reconstruct an individual’s haplotypes with high probability. What are the difficult cases?

If we have no parents, or just one parent, and grandparents’, siblings’ or offsprings’ genotypes are available, which are most informative for an individual’s haplotype reconstruction?

3

Simulation Study

Simulated many different types of pedigrees 300 times each to see which constellations of relatives give the best opportunity of being able to reconstruct haplotypes correctly.

Ranking the contributions in order of importance (assuming that the proband has been genotyped):

1. Parents

2. Grandparents & Siblings

3. Offspring

4

Genotyping

We used STR (short tandem repeat)

also known as microsatellite markers

…AGCTAGCGCGC….GCGCGGCATTA…

…AGCTAGCGCGC….GCGCGGCGCATTA…

Eventual plan: 5 cM genome wide scan (~ 800 markers) with dinucleotide STRs

Data collected in the Tasmanian MS Study

6

MS study in Tasmania: data

Collected 170 (out of an estimated 300) MS cases and 105 controls, and a constellation of ~ 4 relatives for each

Created a case/control study with 338 case haplotypes and 208 control haplotypes

Genotyping carried out at the Australian Genome Research Facility: almost 1 million genotypes (the 2nd largest genotyping project ever carried out in Australia

7

Cases (170) Controls (105)

Grdpts 29 (4%) 12 (3%)

Parents 174 (22%) 123 (30%)

Siblings 374 (46%) 215 (52%)

Spouse 67 (7%) 14 (3%)*

Offspring 168 (20%) 50 (12%)

Relatives of cases and controls

809

17 (1%)Other 0

414*

Some issues associated with the data preparation

9

Errors, errors and errors

• Marker location errors: allocation to wrong chromosome, wrong order, map distances out, Généthon (Dib et al,1996), Marshfield (Broman et al,1998), DeCODE (Kong et al, 2001) included a physical map,

• Pedigree (relationship) errors: PREST (McPeek & Sun)• Genotyping errors (caused by assay or analysis): ones

causing Mendelian inconsistencies; ones which don’t, PEDCHECK (O’Connell), SIBMED (Douglas et al), MERLIN (Abecasis et al)

• Data handling errors (e.g. mixed up samples)• Binning (allele labelling) errors: inconsistencies over time

10

Error checking: a little detail

With genome-wide genotypes, moderately close relationships can be confirmed or falsified: 7 paternity errors (6 incorrect fathers, 1 incorrect mother)

Mix-ups typically stand out (2 DNA sample swaps, 2 duplicate samples, 1 case of contaminated DNA, 1 adopted child unrelated to anyone else

Mendelian checks picked up many genotyping errors: 1,472 inconsistencies (0.15% genotyping errors); 15 markers removed; using Mendel on the X found 3 data entry errors and 4 cases where the recorded sex was wrong

Multilocus methods can pick up more, in effect identifying close double recombinants: 58 errors inferred by this method and put to missing

Other errors demanded special methods.

11

Unforeseen Problem

Marker binning was not consistent over time. Genotyping at 796 markers took over 2 years. Heuristic approach:

• Look at all markers with allele bin differences of 1 bp• Seek large frequency differences: 2 allele by box• Carry out allele binning slippage test (for pairs of adjacent

alleles and boxes): 2

• Markers were flagged if any of the above, and examined for systematic trends

A founder is an individual with no parent in the sample.

12

Example output showing partial allele slippage

Absolute frequencies for given allele (106) in each box is shown in (time) order of genotyping

Alleles in size order

Summary information

Note slippage of allele 104 into allele 106 for Box 7 (yellow)

(Time) order of Genotyping - Box 1, 2+3, 5, 6, 7, 21-23, 24-27.Numbers indicate number of individuals in each box

13

Isolated bin (probably slippage)

14

Example of highly polymorphic marker

15

Box 1 - all alleles shifted +1bp

16

Box 1 Alleles 150, 154 shifted +1bp

17

Fixing allele calls

Need to track changes carefully

18

Obtaining haplotypes

Haplotypes were reconstructed using the Lander-Green-Kruglyak algorithm (Genehunter/Merlin/Allegro).

We’ll go into the details of the algorithm later this lecture, or in the next.

Appropriate case and control datasets with these haplotypes were then prepared. Here’s how, from Genehunter output.

19

Genehunter Output• The genotype data for family MS003 (input)

***** MS003 0.000302 0 0 1 8 11 7 4 5 9 8 6 3 6 5 2 9 5 7 4 6 5 1 4 6 6303 0 0 1 5 10 5 3 3 7 1 1 2 5 5

5 4 5 7 1 4 4 1 3 4 8301 303 302 2 5 10 5 3 3 7 1 1 2 5 5

8 11 7 4 5 9 8 6 3 6 5304 303 302 0 5 4 5 7 1 4 4 1 3 4 8 2 9 5 7 4 6 5 1 4 6 6305 303 302 0 5 10 5 3 3 7 1 1 2 5 5

8 11 7 4 5 9 8 6 3 6 5

MS003 301 303 302 2 2 5 8 10 11 5 7 3 4 3 5 7 9 1 8 1 6 2 3 5 6 5 5MS003 302 0 0 2 1 2 8 9 11 0 0 7 4 0 0 0 0 0 0 1 6 0 0 0 0 5 6MS003 303 0 0 1 1 5 5 4 10 0 0 7 3 0 0 0 0 0 0 1 1 0 0 0 0 5 8MS003 304 303 302 2 0 2 5 4 9 5 5 7 7 1 4 4 6 4 5 1 1 3 4 4 6 6 8MS003 305 303 302 1 0 5 8 10 11 5 7 3 4 3 5 7 9 1 8 1 6 2 3 5 6 5 5

• The haplotype reconstruction for family MS003 (output)

Proband

Father’s transmitted haplotypeMother’s transmitted haplotype

Untransmitted haplotypes

Extracting untransmitted haplotypes from GENEHUNTER

Three types of controls: untransmitted haplotypes (akin to controls in TDT), haplotypes from matched controls to the affecteds, random controls

To derive the untransmitted haplotypes:

use GENEHUNTER to generate the haplotypes (creates haplo.dump file)

extract the untransmitted haplotype

use reconstructed haplotypes of the parents to find the untransmitted haplotype of the affected by negation

Example:Affected’s haplotypes Haplotypes of Parents of affected

2 5 2 12 10 6 8 2 3 0 0 1 0 2 1 2 2 1 1 7 7

1 2 3 13 12 9 7 2 5 2 12 10 6 8 1 2 3 13 12 9 8

Untransmitted haplotypes are: 2 3 0 0 1 0 2, 1 2 2 1 1 7 8

Assessing haplotype sharing

22

Nonparametric haplotype sharing analysis

Why nonparametric, rather than likelihood-based methods?

• Likelihood methods make many assumptions regarding the genealogy of the population. We don’t how many of these assumptions are robust to violations.

• Likelihood methods are computationally intensive, perhaps prohibitively so, especially for genome wide scans where these is a need to maximize over the very large state space of possible ancestral haplotypes (MCMC)

• Likelihood methods have a hard time at HLA because the LD there is extremely high and non uniform (block-like structure)

• Simpler statistics will do better here unless we can model background LD

23

Haplotype sharing statistics for genome wide scan data cf. fine mapping

• Previous statistics (mainly likelihood based) concentrated on fine mapping and the exact localization of variant. They assume a signal exists.

• For us, localization is not the primary interest, rather, detection is the main interest in a genome-wide scan.

24

Towards a sharing statistic

• Our aim was to come up with a statistic that describes haplotype sharing effectively

• At the markers closest to the disease locus the sharing statistic should be particularly large as the haplotype sharing should– extend the furthest, and also– the association of disease with particular haplotypes (alleles in

the single marker case) should be strongest there

• We needed something that was not as computationally intensive as e.g. DHSMAP or BLADE

25

3 5 9 8 7 6 10 1 5 4 3 2 5 Cases 3 2 1 3 7 6 10 1 5 4 1 3 2

1 2 1 3 5 6 10 1 5 2 1 3 4 2 3 7 3 1 6 10 9 1 1 2 5 6 5 9 1 1 4 1 3 1 2

3 1 9 8 7 6 5 3 1 3 2 1 5 9 7 9 1 Controls 7 1 2 1 1 3 5 7 1 5 1 3 29 3 9 2 1 2 7 5 3 4 2 2 5

Testing for shared haplotypes

Score for haplotype sharing (- log p)

Pter- -Qter

26

Sharing drop-off & allelic heterogeneity

Marker Proportions of Cases Proportions of Controls

1

2

3

4

= Cluster 1 haplotypes

= Cluster 2 haplotypes

= neither cluster 1 nor 2 haplotypes

27

Haplo_clusters (Melanie Bahlo)

• Calculates a sharing statistic at every marker

• Assigns significance using a permutation test

• Allows for several clusters of ancestral haplotypes (allelic heterogeneity)

1 genes and ms in tasmania, cont. lecture 6, statistics 246 february 5, 2004

Documents

missingother errors

paternity errors

data entry errors

pedigree relationship

errorsmarker location

relatives of cases

ms cases

kong et