cryptic variation in the human mutation rate alan hodgkinson adam eyre-walker, manolis ladoukakis
TRANSCRIPT
Cryptic Variation in the Human mutation rate
Alan Hodgkinson
Adam Eyre-Walker, Manolis Ladoukakis
Variation in the mutation rate:
• Between different chromosomes
• Between regions on chromosomes
• Neighbouring nucleotides
Simple context effects:
Hwang and Green (2004) PNAS 101: 13994-14001
Cryptic Variation:
• Remote context:
AGTCGGTTACCGTGACGTTGAACGTGT
Cryptic Variation:
• Remote context:
AGTCGGTTACCGTGACGTTGAACGTGT
• Degenerate context:
AGTCGGTTACCGTGYSRGYGAACGTGT
Cryptic Variation:
• Remote context:
AGTCGGTTACCGTGACGTTGAACGTGT
• Degenerate context:
AGTCGGTTACCGTGYSRGYGAACGTGT
• No context / Complex context
Our approach to the problem
• Search for SNPs in human sequences that also have a SNP in the orthologous position in chimp.
Human
Chimp
Our approach to the problem
• Search for SNPs in human sequences that also have a SNP in the orthologous position in chimp.
Human
Chimp
Do we see more coincident SNPs than expected by chance?
The method• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
The method• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.
The method• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.
• BLAST chimp SNPs against human database.
The method• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.
• BLAST chimp SNPs against human database.
• Extract results above a certain level of homology where there is a SNP on both sequences and reduce to 40bp either side of central position.
The method• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.
• BLAST chimp SNPs against human database.
• Extract results above a certain level of homology where there is a SNP on both sequences and reduce to 40bp either side of central position.
• Repeating both including and excluding CpG effects.
Results
• ~1.5 million chimp SNPs.
• ~310,000 81bp alignments containing a human and chimp SNP.
Results
• ~1.5 million chimp SNPs.
• ~310,000 81bp alignments containing a human and chimp SNP.
• Observe the number of coincident SNPs.
• Calculate the expected number, taking into account the effects of neighbouring nucleotides.
Results
Obs Exp Ratio
All 11571 6592 1.76 (1.72,1.79)
No-CpG 5028 2533 1.98 (1.93,2.04)
Results
C/T G/A C/A G/T C/G A/T
C/T 1.91 1.04 1.19 1.21 0.96
G/A 1.83 1.24 1.02 1.14 1.40
C/A 1.23 1.08 4.81 1.28 1.39
G/T 1.15 1.38 4.95 1.27 0.77
C/G 1.09 1.14 1.24 1.40 2.79
A/T 0.94 1.06 1.79 0.99 15.43
Alternative Explanations
• Bias in the Method
• Selection
• Ancestral Polymorphism
• Paralogous SNPs
Alternative Explanations
• Bias in the Method
• Selection
• Ancestral Polymorphism
• Paralogous SNPs
Methodological Bias
• Simulated data with same density of human and chimp SNPs as dbSNP under different divergence and mutation patterns.
• Method worked well under realistic conditions.
Methodological Bias
Div Obs Exp Ratio 95% CI
0 839 812 1.033 (0.963,1.103)
1 2419 2316 1.040 (1.003,1.086)
2 681 685 0.995 (0.920,1.069)
Div Obs Exp Ratio 95% CI
0 401 428 0.936 (0.844,1.028)
1 1182 1228 0.963 (0.908,1.018)
2 374 400 0.935 (0.840,1.030)
All sites (H&G):
Non CpG sites (H&G):
Methodological Bias
Div Obs Exp Ratio 95% CI
0 839 812 1.033 (0.963,1.103)
1 2419 2316 1.040 (1.003,1.086)
2 681 685 0.995 (0.920,1.069)
Div Obs Exp Ratio 95% CI
0 401 428 0.936 (0.844,1.028)
1 1182 1228 0.963 (0.908,1.018)
2 374 400 0.935 (0.840,1.030)
All sites (H&G):
Non CpG sites (H&G):
Alternative Explanations
• Bias in the method
• Selection
• Ancestral Polymorphism
• Paralogous SNPs
Selection
• Areas of low SNP density result in clustering:
Human
Chimp
Selection
• Areas of low SNP density result in clustering:
Human
Chimp
Apparent excess of coincident SNPs
Selection • No clustering:
Alternative Explanations
• Bias in the method
• Selection
• Ancestral Polymorphism
• Paralogous SNPs
Ancestral Polymorphism• SNP inherited from common ancestor of chimp and human:
T
TT
A
T
TT
A
T
AT
A
Common Ancestor
HumanChimp
Ancestral Polymorphism• SNP inherited from common ancestor of chimp and human:
T
TT
A
T
TT
A
T
AT
A
Common Ancestor
HumanChimp
Increase in coincident SNPs
Ancestral Polymorphism
• Expect observed/expected ratio to be same for all transitions:
C/T G/A C/A G/T C/G A/T
C/T 1.91 1.04 1.19 1.21 0.96
G/A 1.83 1.24 1.02 1.14 1.40
C/A 1.23 1.08 4.81 1.28 1.39
G/T 1.15 1.38 4.95 1.27 0.77
C/G 1.09 1.14 1.24 1.40 2.79
A/T 0.94 1.06 1.79 0.99 15.43
Ancestral Polymorphism
• Repeated initial analysis with macaque data.
• Humans and Macaque split ~23-24 million years ago so we expect there to be no shared polymorphisms.
Ancestral Polymorphism
• Repeated initial analysis with macaque data.
• Humans and Macaque split ~23-24 million years ago so we expect there to be no shared polymorphisms.
Obs Exp Ratio
All 77 47 1.64
(1.27,2.00)
No-CpG 34 23 1.51 (1.001,2.02)
Alternative Explanations
• Bias in the method
• Selection
• Ancestral Polymorphism
• Paralogous SNPs
Paralogous SNPs
• Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.
Paralogous SNPs
• Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.
• Musumeci et al (2010): 8.32% of human variation in dbSNP may be due to paralogy.
Paralogous SNPs
• Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.
• Musumeci et al (2010): 8.32% of human variation in dbSNP may be due to paralogy.
AGCTGCACGT Y CGGCATCCAA SNPAGCTGCACGT T CGGCATCCAA Chromosome 1AGCTGCACGT A CGGCATCCAA Chromosome 7
Artifactual SNP
Paralogous SNPs
AGCTGCACGT (T/A) CGGCATCCAAAGCTGCACGT T CGGCATCCAA
AGCTGCACGT (T/A) CGGCATCCAAAGCTGCACGT T CGGCATCCAAAGCTGCACGT A CGGCATCCAA
Paralogous SNPs
AGCTGCACGT (T/A) CGGCATCCAAAGCTGCACGT T CGGCATCCAA
AGCTGCACGT (T/A) CGGCATCCAAAGCTGCACGT T CGGCATCCAAAGCTGCACGT A CGGCATCCAA
3.6% of coincident SNPs are possibly a consequence of paralogous sequences
Alternative Explanations
• Bias in the method
• Selection
• Ancestral Polymorphism
• Paralogous SNPs
Cryptic variation in the mutation rate
Context Analysis
• 4517 sequences containing non-CpG coincident SNPs flanked by 200bp.
• Tabulate triplet frequencies at each position in surrounding sequences.
• Test whether the proportions of triplets we observe at each position significantly different from the proportions in the sequences as a whole.
Context Analysis
• Coincident SNP in central position:
Context Analysis
• Coincident SNP in central position:
No obvious context surrounding coincident SNPs
Genomic Distribution
• Tallied the number of coincident SNPs per MB:
- 3.91 coincident SNPs per MB.
- 1.68 non-CpG coincident SNPs per MB.
Genomic Distribution
• Tallied the number of coincident SNPs per MB:
- 3.91 coincident SNPs per MB.
- 1.68 non-CpG coincident SNPs per MB.
• If randomly distributed expect Poisson distribution and = 2 = 3.91
Genomic Distribution
• Tallied the number of coincident SNPs per MB:
- 3.91 coincident SNPs per MB.
- 1.68 non-CpG coincident SNPs per MB.
• If randomly distributed expect Poisson distribution and = 2 = 3.91
• 2 = 13.27 (p<0.001) and so sampling variance explains approximately 30% of total variance.
Genomic Distribution
Feature r r2 pSNP density 0.256 0.0655 <0.001**
Distance to Telomere
-0.022 0.0004 0.226
Distance to Centromere
0.011 0.0001 0.565
Recombination Rate
0.107 0.0114 <0.001**
Nucleosome Association
0.004 0.0000 0.832
Gene Density -0.022 0.0004 0.230
GC content -0.006 0.0000 0.741
Genomic Distribution
• SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.
Genomic Distribution
• SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.
• Recombination rate positively correlated with SNP density (r = 0.242, p<0.001).
• Partial correlation controlling for SNP density: r = 0.048, p=0.011**.
Genomic Distribution
• SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.
• Recombination rate positively correlated with SNP density (r = 0.242, p<0.001).
• Partial correlation controlling for SNP density: r = 0.048, p=0.011**.
• SNP densities explain 6.5% of the variance, recombination rate explains 0.2% of the variance of coincident SNPs.
Genomic Distribution
Feature r r2 pCoincident SNP
Density0.256 0.0655 <0.001**
Distance to Telomere
-0.171 0.0292 <0.001**
Distance to Centromere
-0.047 0.0022 0.012**
Recombination Rate
0.234 0.0548 <0.001**
Nucleosome Association
0.187 0.0350 <0.001**
Gene Density 0.064 0.0041 0.001**
GC content 0.184 0.0339 <0.001**
Quantification
• Use Log-normal distribution of relative mutation rates due to cryptic variation.
• Model the number of coincident SNPs under the effects of cryptic variation.
• Incorporate effects of divergence.
Quantification
• Use Log-normal distribution of relative mutation rates due to cryptic variation.
• Model the number of coincident SNPs under the effects of cryptic variation.
• Incorporate effects of divergence.
What level of variation in the log-normal distribution explains our results?
Log-normal model
Fastest 5% of sites mutate ~16.4 times faster than slowest 5% of sites.
Summary
• Cryptic variation in the mutation rate.
Summary
• Cryptic variation in the mutation rate.
• No obvious context surrounding coincident SNPs.
Summary
• Cryptic variation in the mutation rate.
• No obvious context surrounding coincident SNPs.
• Variation is truly cryptic.
Summary
• Cryptic variation in the mutation rate.
• No obvious context surrounding coincident SNPs.
• Variation is truly cryptic.
• Genomic distribution of coincident SNPs is over-dispersed
Summary
• Cryptic variation in the mutation rate.
• No obvious context surrounding coincident SNPs.
• Variation is truly cryptic.
• Genomic distribution of coincident SNPs is over-dispersed
• Variation in mutation rate is substantial.
Acknowledgments
Manolis Ladoukakis
• BBSRC
• People:
Adam Eyre-Walker