snp resources: variation discovery, hapmap and the egp mark j. rieder department of genome sciences...
Post on 21-Dec-2015
222 views
TRANSCRIPT
SNP Resources: Variation Discovery, SNP Resources: Variation Discovery, HapMap and the EGPHapMap and the EGP
Mark J. RiederMark J. RiederDepartment of Genome SciencesDepartment of Genome Sciences
[email protected]@u.washington.edu
NIEHS SNPs WorkshopNIEHS SNPs WorkshopJan 10-11, 2008Jan 10-11, 2008
Complex inheritance/disease
Variant Gene
Disease
Diabetes Heart Disease SchizophreniaObesity Multiple Sclerosis Celiac DiseaseCancer Asthma Autism
Many OtherGenes
Environment
Two hypotheses:Two hypotheses:1- common disease/common variant?1- common disease/common variant?2- common disease/many rare variants?2- common disease/many rare variants?
Cases Controls
40% T, 60% C 15% T, 85% C
Multiple Genes
Common Variants
Polymorphic Markers > 300,000 -1,000,000Single Nucleotide Polymorphisms (SNPs)
Strategies for Genetic Analysis
Populations Association Studies
C/C C/T
C/C C/T C/C
C/C
C/TC/C C/C
C/T C/CC/TC/TC/C
Single Gene
Rare Variants
~600 Short Tandem Repeat Markers
FamiliesLinkage Studies
Simple Inheritance Complex Inheritance
Samples with phenotype data (continuous, case-control) (n = 100's - 1000's)
Genotype samples with commercial 'chips'• Affymetrix - Random SNP design (v.5, v.6)• lllumina - preselected tagSNP design (650Y, 1M)
Perform statistical association with each SNPCalculate p-value for each SNP
Region with plausible statistical association
Strategies for Genetic Analysis (GWAS)
Development of a genome-wide SNP map: How many SNPs?Development of a genome-wide SNP map: How many SNPs?
Nickerson and Kruglyak, Nature Genetics, 2001
~ 10 million common SNPs (> 1- 5% MAF) - 1/300 bp~ 10 million common SNPs (> 1- 5% MAF) - 1/300 bp
How has SNP discovery progressed toward this goal?How has SNP discovery progressed toward this goal?
SNP Resources: SNP discovery and catalogingSNP Resources: SNP discovery and cataloging
1. SNP discovery/genotyping: Genome-wide approaches SNP Consortium HapMap
2. The current state of SNP resources (dbSNP)
3. Comprehensive SNP discoveryNIEHS SNPs - Environmental Genome Project
SNP Databases - “How to” Manual for finding SNPs“How to” Manual for finding SNPsIn class - TutorialIn class - Tutorial
HapMap Project: Create a genome-wide SNP mapHapMap Project: Create a genome-wide SNP map
Genotype SNPs in four populations:
• CEPH (CEU) (Europe - n = 90, trios)• Yoruban (YRI) (Africa - n = 90, trios)• Japanese (JPT) (Asian - n = 45)• Chinese (HCB) (Asian - n =45)
To produce a genome-wide map of common variation
Common Variant/Common Disease
Phase I -Phase I - 1M SNPs1M SNPs
Phase II -Phase II - 4M SNPs4M SNPsDensity ~ 1 SNP/kbDensity ~ 1 SNP/kb
626 genes626 genes15 Mbp15 Mbp91,000 SNPs91,000 SNPs
Density - 1 SNP/166 bpDensity - 1 SNP/166 bp
Finding SNPs: Marker Discovery and MethodsFinding SNPs: Marker Discovery and Methods
SNP discovery has proceeded in two distinct phases:SNP discovery has proceeded in two distinct phases:
1 - SNP Identification/Discovery1 - SNP Identification/DiscoveryDefine the allelesDefine the allelesMap this to a unique place in the genomeMap this to a unique place in the genome
2 - SNP Characterization (HapMap genotyping)2 - SNP Characterization (HapMap genotyping)Determination of the genotype in many Determination of the genotype in many
individualsindividualsPopulation frequency of SNPsPopulation frequency of SNPs
Finding SNPs: Sequence-based SNP MiningFinding SNPs: Sequence-based SNP Mining
Sequence Overlap - SNP DiscoverySequence Overlap - SNP Discovery
GTTACGCCAATACAGGTTACGCCAATACAGGGATCCAGGAGATTACCATCCAGGAGATTACCGTTACGCCAATACAGGTTACGCCAATACAGCCATCCAGGAGATTACCATCCAGGAGATTACC
DNADNASEQUENCINGSEQUENCING
mRNAmRNA
cDNAcDNALibraryLibrary
ESTESTOverlapOverlap
Genomic Genomic
BACBACLibraryLibrary
RRSRRSLibraryLibrary
BACBACOverlapOverlap
ShotgunShotgunOverlapOverlap
RT errors?RT errors?
SequencingSequencingQualityQuality
Finding SNPs: Sequence-based SNP MiningFinding SNPs: Sequence-based SNP Mining
BAC = Bacterial Artificial ChromosomeBAC = Bacterial Artificial ChromosomePrimary vector for DNA cloning in the HGPPrimary vector for DNA cloning in the HGP
Fragment DNAFragment DNA
DNA from multiple individualsDNA from multiple individuals
Clone large fragments into BACsClone large fragments into BACs (unknown sequence)(unknown sequence)
Sequence and Reassemble Sequence and Reassemble (known sequence)(known sequence) Assembly with other overlapping BACsAssembly with other overlapping BACs
GTTACGCCAATACAGGTTACGCCAATACAGGGATCCAGGAGATTACCATCCAGGAGATTACCGTTACGCCAATACAGGTTACGCCAATACAGCCATCCAGGAGATTACCATCCAGGAGATTACC
Initiation of project planning (July 2001): Initiation of project planning (July 2001): 2.8 million SNPs (1.4 million validated) - 1/1900 bp2.8 million SNPs (1.4 million validated) - 1/1900 bp
HapMap SNP Discovery: Prior to GenotypingHapMap SNP Discovery: Prior to Genotyping
TACGCCTACGCCTTATAATA TCTCAAAGGAGATAGGAGAT
Generate more SNPs:Generate more SNPs:
Nov 2003 - 5.7 million (2 million validated) - 1/1500 bpNov 2003 - 5.7 million (2 million validated) - 1/1500 bpFeb 2004 - 7.2 million (3.3 million validated) - 1/900 bpFeb 2004 - 7.2 million (3.3 million validated) - 1/900 bp
Other Sources of SNPs:Other Sources of SNPs:Perlegen (Affymetrix chips) SNP data (chr22)Perlegen (Affymetrix chips) SNP data (chr22)Sequence chromatograms from Celera projectSequence chromatograms from Celera project
GTTACGCCGTTACGCCAAATACAGGATCATACAGGATCCCAGGAGATTACCAGGAGATTACC Draft Human GenomeDraft Human Genome
Genomic DNAGenomic DNA(multiple individuals)(multiple individuals)
Sequence and align Sequence and align (reference sequence)(reference sequence)
Random Shotgun SequencingRandom Shotgun Sequencing
dbSNP -NCBI SNP database
SNP Discovery: dbSNP database
SNP data submitted to dbSNP: ClusteringSNP data submitted to dbSNP: Clustering
dbSNP processing of SNPsdbSNP processing of SNPs
SNPs submittedSNPs submittedby research communityby research community((ssubmitted SNPs = ss#)ubmitted SNPs = ss#)
Unique mappingUnique mappingto a genome locationto a genome location((reference eference SNP = rs#)NP = rs#)
(by 2hit-2allele)(by 2hit-2allele)
ValidatedValidated
UnvalidatedUnvalidated
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
Jan-03 Mar-03 Jun-03 Aug-03 Oct-03 Jan-04 Mar-04 Jun-04 Sep-04 Jan-05 Mar-05 Jun-06 Mar-07
dbSNP Release
SNPs(millions)
Total Reference SNPs
Validated SNPs
HapMap SNP DiscoveryHapMap SNP Discovery
HapMap Discovery Increased SNP Density and Validated SNPsHapMap Discovery Increased SNP Density and Validated SNPs
11+ million11+ millionrs SNPsrs SNPs
5.6 million5.6 millionvalidatedvalidatedrs SNPsrs SNPs
Finding SNPs: Marker Discovery and MethodsFinding SNPs: Marker Discovery and Methods
SNP discovery has proceeded in two distinct phases:SNP discovery has proceeded in two distinct phases:
1 - SNP Identification/Discovery1 - SNP Identification/DiscoveryDefine the allelesDefine the allelesMap this to a unique place in the genomeMap this to a unique place in the genome
2 - SNP Characterization (HapMap genotyping)2 - SNP Characterization (HapMap genotyping)Determination of the genotype in many Determination of the genotype in many
individualsindividualsPopulation frequency of SNPsPopulation frequency of SNPs
HapMap Project: Create a genome-wide SNP mapHapMap Project: Create a genome-wide SNP map
Genotype SNPs in four populations:
• CEPH (CEU) (Europe - n = 90, trios)• Yoruban (YRI) (Africa - n = 90, trios)• Japanese (JPT) (Asian - n = 45)• Chinese (HCB) (Asian - n =45)
To produce a genome-wide map of common variation
Finding SNPs: Genotype Data Adds Value to SNPsFinding SNPs: Genotype Data Adds Value to SNPs
Confirms SNP as “real” and “informative”Confirms SNP as “real” and “informative” Minor Allele Frequency (MAF) - common or rareMinor Allele Frequency (MAF) - common or rare MAF differs by different populationMAF differs by different population Detection of SNP x SNP correlations Detection of SNP x SNP correlations
(Linkage Disequilibrium)(Linkage Disequilibrium) Determine tagging SNPs (tagSNPs)Determine tagging SNPs (tagSNPs) Determine haplotypesDetermine haplotypes
Indiv.#1:Indiv.#1: GTTACGCCAATACAG[ GTTACGCCAATACAG[GG//GG]]ATCCAGGAGATTACCATCCAGGAGATTACCIndiv.#2:Indiv.#2: GTTACGCCAATACAG[ GTTACGCCAATACAG[CC//GG]]ATCCAGGAGATTACCATCCAGGAGATTACCIndiv.#3:Indiv.#3: GTTACGCCAATACAG[ GTTACGCCAATACAG[CC//CC]]ATCCAGGAGATTACCATCCAGGAGATTACC
Interleukin 6 – Collapsed Dataset: Minor Allele Frequencey (MAF) > 10 %
African-Am.African-Am.
European-Am.European-Am.
• Most of the SNPs in the genome are rareMost of the SNPs in the genome are rare• Significant differences in SNP frequency can existSignificant differences in SNP frequency can exist• Correlation between SNPs can be used to select the most informative SNPsCorrelation between SNPs can be used to select the most informative SNPs
0.00
1.00
2.00
3.00
4.00
5.00
6.00
Jan-03 Mar-03 Jun-03 Aug-03 Oct-03 Jan-04 Mar-04 Jun-04 Sep-04 Jan-05 Mar-05 May-06 Mar-07
dbSNP Release
SNPs(millions)
Validated SNPs
SNPs with Genotypes
dbSNP: All validated SNPs now have been genotyped!dbSNP: All validated SNPs now have been genotyped!
HapMap GenotypingHapMap Genotyping
HapMap release = 4 M validated QC SNPsHapMap release = 4 M validated QC SNPs
5.6 million5.6 millionvalidatedvalidatedrs SNPsrs SNPs
Development of a genome-wide SNP map: How many SNPs?
Nickerson and Kruglyak, Nature Genetics, 2001
~ 10 million common SNPs (> 1-5% MAF) - 1/300 bp~ 10 million common SNPs (> 1-5% MAF) - 1/300 bp
When will we have them all?When will we have them all?
Feb 2001Feb 2001 - 1.42 million (1/1900 bp) - 1.42 million (1/1900 bp)Nov 2003Nov 2003 - 2.0 million (1/1500 bp) - 2.0 million (1/1500 bp)Feb 2004 Feb 2004 - 3.3 million (1/900 bp)- 3.3 million (1/900 bp)Mar 2007Mar 2007 - 5.6 million (validated - 1/535 bp) - 5.6 million (validated - 1/535 bp)
Finding SNPs: Sequence-based SNP MiningFinding SNPs: Sequence-based SNP Mining
RANDOM Sequence Overlap - SNP DiscoveryRANDOM Sequence Overlap - SNP Discovery
GTTACGCCAATACAGGTTACGCCAATACAGGGATCCAGGAGATTACCATCCAGGAGATTACCGTTACGCCAATACAGGTTACGCCAATACAGCCATCCAGGAGATTACCATCCAGGAGATTACC
Genomic Genomic
RRSRRSLibraryLibrary
ShotgunShotgunOverlapOverlap
BACBACLibraryLibrary
BACBACOverlapOverlap
DNADNASEQUENCINGSEQUENCING
mRNAmRNA
cDNAcDNALibraryLibrary
ESTESTOverlapOverlap
RandomRandomShotgunShotgun
Align toAlign toReferenceReference
SNP discovery is dependent on your sample population sizeSNP discovery is dependent on your sample population size
GTTACGCCAATACAGGTTACGCCAATACAGGGATCCAGGAGATTACCATCCAGGAGATTACCGTTACGCCAATACAGGTTACGCCAATACAGCCATCCAGGAGATTACCATCCAGGAGATTACC{{2 chromosomes2 chromosomes
0.0 0.2 0.3 0.4 0.50.10.0
0.5
1.0
Minor Allele Frequency (MAF)
Fra
ctio
n o
f S
NP
s D
isco
vere
d
2
888
HapMap data doesn’t capture HapMap data doesn’t capture low frequency SNPslow frequency SNPs
SNP Characterization/GenotypingSNP Characterization/Genotyping
Nickerson and Kruglyak, Nature Genetics, 2001
~ 10 million common SNPs (>1-5% MAF) - 1/300 bp~ 10 million common SNPs (>1-5% MAF) - 1/300 bpMar 2007Mar 2007 - 5.6 million (validated/mapped - 1/535 bp) - 5.6 million (validated/mapped - 1/535 bp)
When will we have them all?When will we have them all? HapMap data doesn’t capture low frequency allelesHapMap data doesn’t capture low frequency allelesUsed as a proxy for all common variation through linkage disequilibriumUsed as a proxy for all common variation through linkage disequilibrium
Targeted SNP DiscoveryTargeted SNP Discovery
Directed analysis: cSNPs
5’ 3’
Complete analysis: cSNPs, Linkage Disequilbrium and Haplotype Data
Arg-Cys Val-Val
5’ 3’Arg-Cys Val-Val
PCR amplicons
PCR amplicons
•Generate SNP data from complete genomic resequencing
(i.e. 5’ regulatory, exon, intron, 3’ regulatory sequence)
Increasing Sample Size Improves SNP DiscoveryIncreasing Sample Size Improves SNP Discovery
GTTACGCCAATACAGGTTACGCCAATACAGGGATCCAGGAGATTACCATCCAGGAGATTACCGTTACGCCAATACAGGTTACGCCAATACAGCCATCCAGGAGATTACCATCCAGGAGATTACC{{2 chromosomes2 chromosomes
0.0 0.2 0.3 0.4 0.50.10.0
0.5
1.0
Minor Allele Frequency (MAF)
Fra
ctio
n o
f S
NP
s D
isco
vere
d
2
88
4824
16
8
96
HapMapHapMapBased on Based on ~ 8 ~ 8 chromosomeschromosomes
ResequencingResequencing(n=48)(n=48)
Increasing SNP Density: HapMap ENCODE ProjectIncreasing SNP Density: HapMap ENCODE Project
ENCODEENCODE = = ENCENCyclopedia yclopedia OOf f DDNA NA EElementslementsCatalog all functional elements in 1% of the genome (30 Mb)Catalog all functional elements in 1% of the genome (30 Mb)
10 Regions x 500 kb/region (Pilot Project)10 Regions x 500 kb/region (Pilot Project) David Altschuler (Broad), Richard Gibbs (Baylor)David Altschuler (Broad), Richard Gibbs (Baylor) 16 CEU, 16 YRI, 8 HCB, 8 JPT16 CEU, 16 YRI, 8 HCB, 8 JPT Comprehensive PCR based resequencing across these regionsComprehensive PCR based resequencing across these regions
15,367 dbSNP15,367 dbSNP16,248 New SNPs16,248 New SNPs50% of SNPs in dbSNP50% of SNPs in dbSNP
5 Mb/31,500 SNPs = 5 Mb/31,500 SNPs = 1/160 bp1/160 bp
Goal: Comprehensively identify all common sequence variation in candidate genes
Initial biological focus: Candidate environmental response genes involved in DNA repair, cell cycle, apoptosis, metabolism, cell signaling, and oxidative stress.
Approach: Direct resequencing of genes
Samples: PDR-90 ethnically diverse individuals representative of U.S. population (397 genes)EGP95-95 samples from four ethnic groups (227 genes)
(24 HapMap Asians, 22 HapMap Europeans, 12 HapMap Yorubans, 15 African Americans, 22 Hispanic)
Nov 2005 - Zaitlen et al. Genome Research 15:1594-1600Nov 2005 - Zaitlen et al. Genome Research 15:1594-1600
Summary of NIEHS SNPs genotypes in dbSNPSummary of NIEHS SNPs genotypes in dbSNP
626 genes sequenced626 genes sequenced15 Mb scanned15 Mb scanned> 90,000 genotyped SNPs identified> 90,000 genotyped SNPs identified> 8 million genotypes deposited in dbSNP> 8 million genotypes deposited in dbSNP
Development of a genome-wide SNP map: How many SNPs?Development of a genome-wide SNP map: How many SNPs?
Nickerson and Kruglyak, Nature Genetics, 2001
~ 10 million SNPs (>1% MAF-estimated) - 1/300 bp~ 10 million SNPs (>1% MAF-estimated) - 1/300 bp
NIEHS SNPs = 1/166 bp (n = 95, 4 pops)NIEHS SNPs = 1/166 bp (n = 95, 4 pops)HapMap ENCODE = 1/160 bp (n = 48, 3 pops)HapMap ENCODE = 1/160 bp (n = 48, 3 pops)
Comprehensive resequencing can identify theComprehensive resequencing can identify the vast majority of SNPs in a regionvast majority of SNPs in a region
SNP Discovery: dbSNP database
Minor Allele Freq. (MAF)
dbSNP (Perlegen/HapMap)
Rarer and population specific SNPs are found by resequencing
60%60%
NIEHS SNPs
15%15%{ 25%25% {
Pathways
Mismatch repair
Double strand break repair
Apoptosis
Cell cycle
Transcription coupled repair
nsSNPs are found in ~60% of genesnsSNPs are found in ~60% of genes
Protein disrupting SNPs in EGP dataProtein disrupting SNPs in EGP data
SIFT = Sorting Intolerant From TolerantSIFT = Sorting Intolerant From TolerantEvolutionary comparison of non-synonymous SNPsEvolutionary comparison of non-synonymous SNPs
Classifications: Classifications: Tolerant, IntolerantTolerant, IntolerantNg and Henikoff, Genome Research, 12:436-446, 2002
PolyPhen - PolyPhen - PolyPolymorphism morphism PhenPhenotypingotypingStructural protein characteristics and evolutionary comparisonStructural protein characteristics and evolutionary comparison
Classifications:Classifications:benign, possibly damaging, probably damagingbenign, possibly damaging, probably damagingRamensky, et al., Nucleic Acids Research, 30:17:3894-3900, 2002
Computational inference of nsSNP functionComputational inference of nsSNP function
Deleterious nsSNPs are preferentially low frequencyDeleterious nsSNPs are preferentially low frequency
EGP and HapMap GenotypingEGP and HapMap GenotypingPDR = 90 ethnically diverse individuals representative of U.S. population
PDRPDRethnically diverseethnically diverse
representation of USrepresentation of USn = 90 (anonymous)n = 90 (anonymous)
397 genes/55,000 SNPs397 genes/55,000 SNPs
European (CEU,n=60)European (CEU,n=60)
African (YRI, n =60)African (YRI, n =60)
Asian (HCB, n = 45)Asian (HCB, n = 45)
Asian (JPT, n = 45)Asian (JPT, n = 45)
Hispanic (n = 60)Hispanic (n = 60)
Afr.-American (n = 62)Afr.-American (n = 62)
Select SNPsSelect SNPs
~7600 SNPs~7600 SNPsIllumina Illumina
GenotypesGenotypesAllele FrequencyAllele Frequency
HWEHWE
EGP p1 panelEGP p1 panel
Illumina NIEHS SNPs Genotyping
Each well samples 1536 SNPs in one individual
For each HapMap sample 5 x 1536 (7680 genotyped SNPs)
3,000,000 genotypes generated (total ~400 samples)
NIEHS SNPs Genotype Data
PDR (397 genes)PDR (397 genes)SNPs characterized in six SNPs characterized in six different major populationsdifferent major populations..
Yoruban
Japanese
CEPH
Chinese
AfricanAmerican
Hispanic
CEPH Japanese ChineseAfricanAmerican Hispanic
0.341 0.9310 . 2820.274
0.8410.626 0.533
0.415
0 . 410 0 . 7610 . 952
0 . 610
0 . 420
0 . 589
0 . 765
Population Allele Frequency CorrelationsIllumina NIEHS SNPs Genotyping
EGP and HapMap CompatibilityEGP and HapMap Compatibility
Panel 2 (HapMap based, n=95)Panel 2 (HapMap based, n=95)
European (CEU,n=60)European (CEU,n=60)
African (YRI, n =60)African (YRI, n =60)
Asian (HCB, n = 45)Asian (HCB, n = 45)
Asian (JPT, n = 45)Asian (JPT, n = 45)
Hispanic (n = 60)Hispanic (n = 60)
Afr.-American (n = 62)Afr.-American (n = 62)
European (CEU,n=22)European (CEU,n=22)
African (YRI, n =12)African (YRI, n =12)
Afr.-American (n = 15)Afr.-American (n = 15)
Asian (JPT & HCB, n = 24Asian (JPT & HCB, n = 24
Hispanic (n = 22)Hispanic (n = 22)
HapMap/US Pop.HapMap/US Pop.n = 332n = 332
EGP p2 DNA panelHapMap Integration (~4 million SNPs)
= EGP SNP discovery (1/200 bp)= HapMap SNPs (~1/1000 bp)
High Density Genic Coverage (EGP)High Density Genic Coverage (EGP)
Low Density Genome Coverage (HapMap)Low Density Genome Coverage (HapMap)
Summary: The Current State of SNP ResourcesSummary: The Current State of SNP Resources
Approximately 10 million Approximately 10 million commoncommon SNPs exist in the human SNPs exist in the human genome genome
(1/300 bp).(1/300 bp).
Random SNP discovery processes generate many SNPs (HapMap)Random SNP discovery processes generate many SNPs (HapMap)
Random approaches to SNP discovery have reached limits of Random approaches to SNP discovery have reached limits of discoverydiscovery
and validation (1/600 bp; 50% SNP validation)and validation (1/600 bp; 50% SNP validation)
Most validated SNPs (5+ million) have been genotyped by the Most validated SNPs (5+ million) have been genotyped by the HapMap (3 pops)HapMap (3 pops)
Resequencing approaches continue to catalog important variants Resequencing approaches continue to catalog important variants (rare (rare
and common not captured by the HapMap)and common not captured by the HapMap)
NIEHS SNPs has generated SNP data on >600 candidate genes NIEHS SNPs has generated SNP data on >600 candidate genes and 90 K SNPs along with large-scale HapMap+ and 90 K SNPs along with large-scale HapMap+ characterization data for of SNPs generated using the PDRcharacterization data for of SNPs generated using the PDR