1 genes and ms in tasmania, cont. lecture 6, statistics 246 february 5, 2004

27
1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

Upload: hope-stephens

Post on 17-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

1

Genes and MS in Tasmania, cont.

Lecture 6, Statistics 246February 5, 2004

Page 2: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

Nature and numbr of relatives needed to give accurate haplotypes

Exercise. Explain why it is that when we have both sets of parental genotypes, and the markers are reasonably polymorphic, we can reconstruct an individual’s haplotypes with high probability. What are the difficult cases?

If we have no parents, or just one parent, and grandparents’, siblings’ or offsprings’ genotypes are available, which are most informative for an individual’s haplotype reconstruction?

Page 3: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

3

Simulation Study

Simulated many different types of pedigrees 300 times each to see which constellations of relatives give the best opportunity of being able to reconstruct haplotypes correctly.

Ranking the contributions in order of importance (assuming that the proband has been genotyped):

1. Parents

2. Grandparents & Siblings

3. Offspring

Page 4: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

4

Genotyping

We used STR (short tandem repeat)

also known as microsatellite markers

…AGCTAGCGCGC….GCGCGGCATTA…

…AGCTAGCGCGC….GCGCGGCGCATTA…

Eventual plan: 5 cM genome wide scan (~ 800 markers) with dinucleotide STRs

Page 5: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

Data collected in the Tasmanian MS Study

Page 6: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

6

MS study in Tasmania: data

Collected 170 (out of an estimated 300) MS cases and 105 controls, and a constellation of ~ 4 relatives for each

Created a case/control study with 338 case haplotypes and 208 control haplotypes

Genotyping carried out at the Australian Genome Research Facility: almost 1 million genotypes (the 2nd largest genotyping project ever carried out in Australia

Page 7: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

7

Cases (170) Controls (105)

Grdpts 29 (4%) 12 (3%)

Parents 174 (22%) 123 (30%)

Siblings 374 (46%) 215 (52%)

Spouse 67 (7%) 14 (3%)*

Offspring 168 (20%) 50 (12%)

Relatives of cases and controls

809

17 (1%)Other 0

414*

Page 8: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

Some issues associated with the data preparation

Page 9: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

9

Errors, errors and errors

• Marker location errors: allocation to wrong chromosome, wrong order, map distances out, Généthon (Dib et al,1996), Marshfield (Broman et al,1998), DeCODE (Kong et al, 2001) included a physical map,

• Pedigree (relationship) errors: PREST (McPeek & Sun)• Genotyping errors (caused by assay or analysis): ones

causing Mendelian inconsistencies; ones which don’t, PEDCHECK (O’Connell), SIBMED (Douglas et al), MERLIN (Abecasis et al)

• Data handling errors (e.g. mixed up samples)• Binning (allele labelling) errors: inconsistencies over time

Page 10: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

10

Error checking: a little detail

With genome-wide genotypes, moderately close relationships can be confirmed or falsified: 7 paternity errors (6 incorrect fathers, 1 incorrect mother)

Mix-ups typically stand out (2 DNA sample swaps, 2 duplicate samples, 1 case of contaminated DNA, 1 adopted child unrelated to anyone else

Mendelian checks picked up many genotyping errors: 1,472 inconsistencies (0.15% genotyping errors); 15 markers removed; using Mendel on the X found 3 data entry errors and 4 cases where the recorded sex was wrong

Multilocus methods can pick up more, in effect identifying close double recombinants: 58 errors inferred by this method and put to missing

Other errors demanded special methods.

Page 11: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

11

Unforeseen Problem

Marker binning was not consistent over time. Genotyping at 796 markers took over 2 years. Heuristic approach:

• Look at all markers with allele bin differences of 1 bp• Seek large frequency differences: 2 allele by box• Carry out allele binning slippage test (for pairs of adjacent

alleles and boxes): 2

• Markers were flagged if any of the above, and examined for systematic trends

A founder is an individual with no parent in the sample.

Page 12: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

12

Example output showing partial allele slippage

Absolute frequencies for given allele (106) in each box is shown in (time) order of genotyping

Alleles in size order

Summary information

Note slippage of allele 104 into allele 106 for Box 7 (yellow)

(Time) order of Genotyping - Box 1, 2+3, 5, 6, 7, 21-23, 24-27.Numbers indicate number of individuals in each box

Page 13: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

13

Isolated bin (probably slippage)

Page 14: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

14

Example of highly polymorphic marker

Page 15: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

15

Box 1 - all alleles shifted +1bp

Page 16: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

16

Box 1 Alleles 150, 154 shifted +1bp

Page 17: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

17

Fixing allele calls

Need to track changes carefully

Page 18: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

18

Obtaining haplotypes

Haplotypes were reconstructed using the Lander-Green-Kruglyak algorithm (Genehunter/Merlin/Allegro).

We’ll go into the details of the algorithm later this lecture, or in the next.

Appropriate case and control datasets with these haplotypes were then prepared. Here’s how, from Genehunter output.

Page 19: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

19

Genehunter Output• The genotype data for family MS003 (input)

***** MS003 0.000302 0 0 1 8 11 7 4 5 9 8 6 3 6 5 2 9 5 7 4 6 5 1 4 6 6303 0 0 1 5 10 5 3 3 7 1 1 2 5 5

5 4 5 7 1 4 4 1 3 4 8301 303 302 2 5 10 5 3 3 7 1 1 2 5 5

8 11 7 4 5 9 8 6 3 6 5304 303 302 0 5 4 5 7 1 4 4 1 3 4 8 2 9 5 7 4 6 5 1 4 6 6305 303 302 0 5 10 5 3 3 7 1 1 2 5 5

8 11 7 4 5 9 8 6 3 6 5

MS003 301 303 302 2 2 5 8 10 11 5 7 3 4 3 5 7 9 1 8 1 6 2 3 5 6 5 5MS003 302 0 0 2 1 2 8 9 11 0 0 7 4 0 0 0 0 0 0 1 6 0 0 0 0 5 6MS003 303 0 0 1 1 5 5 4 10 0 0 7 3 0 0 0 0 0 0 1 1 0 0 0 0 5 8MS003 304 303 302 2 0 2 5 4 9 5 5 7 7 1 4 4 6 4 5 1 1 3 4 4 6 6 8MS003 305 303 302 1 0 5 8 10 11 5 7 3 4 3 5 7 9 1 8 1 6 2 3 5 6 5 5

• The haplotype reconstruction for family MS003 (output)

Proband

Father’s transmitted haplotypeMother’s transmitted haplotype

Untransmitted haplotypes

Page 20: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

Extracting untransmitted haplotypes from GENEHUNTER

Three types of controls: untransmitted haplotypes (akin to controls in TDT), haplotypes from matched controls to the affecteds, random controls

To derive the untransmitted haplotypes:

use GENEHUNTER to generate the haplotypes (creates haplo.dump file)

extract the untransmitted haplotype

use reconstructed haplotypes of the parents to find the untransmitted haplotype of the affected by negation

Example:Affected’s haplotypes Haplotypes of Parents of affected

2 5 2 12 10 6 8 2 3 0 0 1 0 2 1 2 2 1 1 7 7

1 2 3 13 12 9 7 2 5 2 12 10 6 8 1 2 3 13 12 9 8

Untransmitted haplotypes are: 2 3 0 0 1 0 2, 1 2 2 1 1 7 8

Page 21: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

Assessing haplotype sharing

Page 22: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

22

Nonparametric haplotype sharing analysis

Why nonparametric, rather than likelihood-based methods?

• Likelihood methods make many assumptions regarding the genealogy of the population. We don’t how many of these assumptions are robust to violations.

• Likelihood methods are computationally intensive, perhaps prohibitively so, especially for genome wide scans where these is a need to maximize over the very large state space of possible ancestral haplotypes (MCMC)

• Likelihood methods have a hard time at HLA because the LD there is extremely high and non uniform (block-like structure)

• Simpler statistics will do better here unless we can model background LD

Page 23: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

23

Haplotype sharing statistics for genome wide scan data cf. fine mapping

• Previous statistics (mainly likelihood based) concentrated on fine mapping and the exact localization of variant. They assume a signal exists.

• For us, localization is not the primary interest, rather, detection is the main interest in a genome-wide scan.

Page 24: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

24

Towards a sharing statistic

• Our aim was to come up with a statistic that describes haplotype sharing effectively

• At the markers closest to the disease locus the sharing statistic should be particularly large as the haplotype sharing should– extend the furthest, and also– the association of disease with particular haplotypes (alleles in

the single marker case) should be strongest there

• We needed something that was not as computationally intensive as e.g. DHSMAP or BLADE

Page 25: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

25

3 5 9 8 7 6 10 1 5 4 3 2 5 Cases 3 2 1 3 7 6 10 1 5 4 1 3 2

1 2 1 3 5 6 10 1 5 2 1 3 4 2 3 7 3 1 6 10 9 1 1 2 5 6 5 9 1 1 4 1 3 1 2

3 1 9 8 7 6 5 3 1 3 2 1 5 9 7 9 1 Controls 7 1 2 1 1 3 5 7 1 5 1 3 29 3 9 2 1 2 7 5 3 4 2 2 5

Testing for shared haplotypes

Score for haplotype sharing (- log p)

Pter- -Qter

Page 26: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

26

Sharing drop-off & allelic heterogeneity

Marker Proportions of Cases Proportions of Controls

1

2

3

4

= Cluster 1 haplotypes

= Cluster 2 haplotypes

= neither cluster 1 nor 2 haplotypes

Page 27: 1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004

27

Haplo_clusters (Melanie Bahlo)

• Calculates a sharing statistic at every marker

• Assigns significance using a permutation test

• Allows for several clusters of ancestral haplotypes (allelic heterogeneity)