characterizing snps using genomic information
DESCRIPTION
Characterizing SNPs Using Genomic Information. Ivan P. Gorlov, Olga Y. Gorlova & Christopher I. Amos. Colon cancer results from genetic alterations in multiple genes. Inherited mutations in the APC gene dramatically increase risk of colon cancer - PowerPoint PPT PresentationTRANSCRIPT
Characterizing SNPs Using Genomic Information
Ivan P. Gorlov, Olga Y. Gorlova & Christopher I. Amos
Colon cancer results from genetic alterations in multiple genes
Inherited mutations in the APC gene dramatically increase risk of colon cancer
Cancer development reflects early germline susceptibility as well as subsequent mutations that promote cancer
Lu et al., 2014
The continuum of variation
Chung et al.Carcinogenesis 31:111-120, 2010
Linkage
Sequencing
Association
SNPs are the most common type of polymorphism in the human genome
SNPs occur ~ every 400 bases
Minor allele frequency describes prevalence of rarer variant (0,0.5]
Very Few (<0.01%) SNPs are associated with diseases
SNPs are the bread of Genome Wide Association Studies (GWAS)
Success requires inference about effects from Causal Variants
GWAS Have Been Highly Successful in Finding Novel Assocations
Manhattan plot of all lung cancers from 1000 Genomes Imputation
TP63
hTERT
hMSH5BRCA2
CHRNA5
CHEK2
There are multiple signals on Chr 15q.25 - sequencing of 96 people
Exon 476,667,807 GG->GA rs2229961 exon regionVal134->ILE134
Exon 576,669,980 GG->AA rs16969968 exon regionAsp398->Asn398 (negatively charged to positively charged)
76,669,876 TT->TA exon regionLeu363->Glu363 (hydrophobic to hydrophilic)
76,669,781 CC->CT exon regionThr331->Thr331
76,669,288 AA->GA exon regionLys167->Arg167
76,669,665-6 AC deletion exon regionReading frame shift after codon292
Observed in 4 casesIntron 4 76668145
Additional variation (small microsatellites and in/dels) in the promoters of both CHRNA5 and CHRNA3
sdSNPs are deleterious enough to impair gene function and increase disease risk.
Their effects, however, are not strong enough for selection to eliminate them from population (genetic drift, founder effects, and population bottlenecks are factors that help retain sdSNPs).
We estimated the proportion of nonsynonymous SNPs predicted to be functional by PolyPhen and SIFT in different MAF categories
PolyPhen SIFT
Proportions of functional SNPs in different MAF categories
The Erudite Hypothesis:
Large fraction of the genetic susceptibility influenced by rare (<5%) variants with relatively strong effect size
Targeting rarer variants for analysis may detect causal variants when sample size is large.
The majority of the significant SNPs detected by GWAS are SNPs tagging untyped causal variants – greatly reduces power – but indicates region for further study
Brute force analysis is not going to be sufficient to overcome power loss for multiple comparison – have to upweight analyses for most likely causal variants
MAF and a Required Sample Size
The numbers of cases and controls that are required in an association study to detect disease variants with allelic odds ratios of 1.2 (red), 1.3 (blue), 1.5 (yellow), and 2 (black). Numbers shown are for a statistical power of 80% at a significance level of P <10-6.
SNP Replication RateGWAS replication rate is low: about 5%This suggest that GWAS produce a considerable number of false discoveries
Development of methods to predict which SNP will be replicated and which will not is important – relevant for seq.
We evaluated this approach using the results of published GWASs
Evaluating SNP replicationsRetrieved all data from the Catalog of
Published GWA Studies (http://www.genome.gov/26525384/), Sep 13
Restricted analysis to SNPs mapping to a single gene and associated with a disease rather than variation
SNPs related to 106 diseases were studied, 2659 SNPs from 512 studies
SNP findings sorted by date with the first report denoted as a discovery and subsequent studies may be replications
Reproducibility score modeled as the ratio of successful replications over the total number of subsequent studies.
Characteristics Studied
3' UTR, 3' Downstream, 5' Upstream, 5' UTR, Coding nonsynonymous, Coding synonymous, Intergenic, Intronic, Non-coding, and Non-coding intronic.
Name Description Source of the data Conservation index
The level of evolutionary conservation of the protein based on the most distant homolog of the human gene
NCBI HomoloGene database: http://www.ncbi.nlm.nih.gov/homologene
eQTL SNP reported as an eQTL for HapMap Data eQTL SNPs identified in lymphoblastic cell lines from HapMap project [16]
Gene Size Size of the gene region in nucleotides NCBI RefSeq database: http://www.ncbi.nlm.nih.gov/refseq/ Growth factor The protein encoded by the linked gene is a
growth factor Gene Ontology (GO) database http://geneontology.org/
Kinase The protein encoded by the linked gene is a kinase
Gene Ontology (GO) database http://geneontology.org/
-Log(P) Minus LOG(P) where P is the P-value reported in CPG
Catalog of Published GWAS (CPG): http://www.genome.gov/26525384
MAF Minor allele frequency Catalog of Published GWAS (CPG): http://www.genome.gov/26525384 . MAF reported in control group were used.
Nuclear Localization
The protein encoded by the linked gene is localized in the nucleus
Gene Ontology (GO) database http://geneontology.org/
OMIM Was associated gene in OMIM http://www.ncbi.nlm.nih.gov/omim Plasma Membrane
The protein encoded by the linked gene is localized in plasma membrane
Gene Ontology (GO) database http://geneontology.org/
Receptor The protein encoded by the linked gene is a receptor
Gene Ontology (GO) database http://geneontology.org/
SNP type Type of the SNP. For the details see materials and methods
Catalog of the Published GWAS (CPG): http://www.genome.gov/26525384
Tissue specific The expression of the linked gene is tissue specific
Tissue specific Gene Expression and Regulation (TiGER) database: http://bioinfo.wilmer.jhu.edu/tiger/
Transcription factor
The protein encoded by the linked gene is a transcription factor
Gene Ontology (GO) database http://geneontology.org/
Univariate Test ResultsPredictor Test Statistics P-value
Conservation Index Spearman correlation Spearman R = 0.09 0.0007 eQTL Mann-Whitney U test (M-W U test) Z adjusted = -5.39 1.90E-07 Gene Size Spearman correlation Spearman R = -0.11 0.00001 Growth Factor Mann-Whitney U test (M-W U test) Z adjusted = -2.79 0.008 Kinase Mann-Whitney U test (M-W U test) Z adjusted = -7.84 2.40E-14 -Log(P) Spearman correlation Spearman R = 0.36 2.30E-08 MAF Spearman correlation Spearman R = 0.09 0.0007 Nuclear Localization Mann-Whitney U test (M-W U test) Z adjusted = -7.18 2.20E-12 OMIM Mann-Whitney U test (M-W U test) Z adjusted = 8.16 2.20E-15 Plasma Membrane Mann-Whitney U test (M-W U test) Z adjusted = -6.23 1.80E-09 Receptor Mann-Whitney U test (M-W U test) Z adjusted = -5.17 8.90E-07 SNP Type Kruskal-Wallis (KW) test (KW test) Chi-Square = 391. 8, df = 9 5.50E-79 Tissue Specific Mann-Whitney U test (M-W U test) Z adjusted = -3.97 0.0001 Transcription Factor Mann-Whitney U test (M-W U test) Z adjusted = 3.32 0.008
Further evaluating SNP Type(1) All pair-wise comparisons inside the
group should be insignificant; and (2) All pairwise comparisons between the groups should be significant.
SNP reproducibility was lowest in the group 1 (5’ UTR, Intergenic, 5’ Upstream, Non-coding intronic, Intronic), intermediate in the group 2 (Coding synonymous, 3’ Downstream, Non-coding), and highest in the group 3 (Coding nonsynonymous, 5’UTR).
43 SNPs had >1 annotation
Univariate Predictors of Replication
Multivariable Analysis of PredictorsCharacteristic B se(B) p-valueSNP_TYPE 0.034687 0.006443 7.31E-08Pvalue_mlog 0.000936 0.000187 5.34E-075_MAF_Groups 0.005509 0.001293 2.03E-05
OMIM 0.021424 0.007765 0.0058Nuclear localization 0.025213 0.008232 0.002192
Kinases -0.01213 0.014694 0.409105Conservation index 0.052759 0.04749 0.266581Growth factor -0.01789 0.020355 0.379Tissue_specific -0.00375 0.007339 0.609778eQTL HapMap -0.00122 0.007148 0.864
Receptors 0.019558 0.012282 0.111304
Transcription factors -0.02182 0.012491 0.080629Gene Size 0.011653 0.007745 0.132435
Interactions among factors
Identification of Causal VariantsShowing a variants has a causal effect on a
disease process is challengingGood candidate for the causal variant should be
linked to relevant biological mechanism. For example it may have effect on gene expression or protein structure/folding.
Functional analysis is needed to prove that (1) SNP change the important biological function, and that (2) alteration of function is associated with disease risk.
ConclusionsMultivariable analysis showed that 11% of
variability was explained by SNP predictors with 5% attributed to –log(P) value alone.
SNP characteristics are second most prominent, followed by MAF (in wrong direction for weighting)
Could also define profile measurement
Number of GWA Studies Evaluated
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180
5
10
15
20
25
30
35
Num
ber o
f Diff
eren
t Dis
ease
s Stu
died
Number of Replicates
Selecting SNPs for ValidationSelection SNPs for validation stage is trickyThe common approach is based on the
ranking SNPs based on P-values from the discovery stage and select the top SNPs
People also take into account the region: SNPs located in the region detected by linkage analysis of associated with genes from candidate pathway are considered to be a good candidate for replication
Are GWAS-Detected SNPs Tagging or Causal Variants
Even large genotyping studies e.g. genotyping 1,000,000 SNPs cover only a fraction of genetic variation in the human genome: there are more than 70,000,000 SNPs in the human genome
Causal variant may or may not be on a genotyping platform but is likely to be captured by sequencing studiesShort repeats are not detectableSome regions of the genome are not yet mapped Inversions are difficult to identify
Improving power to detect uncommon variants Single variant as disease locus*Hardy-Weinberg Equilibrium in controlsAffection status modeled by a penetrance
table. case-control samples
High risk allele A with frequency pLow risk allele B with frequency q
Power calculationactual sample size needed for genotype-
phenotype relationship to riskPeng et al. Hum Genet. 2010 Jun;127(6):699-704.*Multiple variants in a region can be modeled using burden tests
Power = 0.8 Relative risk A: 4, Disease prevalence: 0.05,Disease allele frequency: p=0.01, MAF 0.01, significance 1 x 10 -7
Discovery, Expansion,and Replication
• Find new associations through pooled or meta-analysis
• Independent replication studies to confirm genotype-phenotype associations
• Fine-map of association signals
Biological Studies
• Identify risk-enhancing genetic variants• Examine functional consequences of a
genetic variant• Determine biological mechanisms of risk
enhancement
Epidemiologic Studies
• Evaluate gene-gene and gene-environment interactions
• Assess penetrance and population attributable risk
• Develop complex risk models• Evaluate clinical/analytic validity of risk
models in observational studies
GAME-ON Integrative Projects
Steering Committee(policies and processes)
2 voting members from each U19Working Groups ChairsNCI Program Officers
Next Generation Genomics
TechnologiesAnalytic and Risk
Modeling Functional Assays Epidemiology and Clinical
ExternalAdvisory
Committee
Lung Prostate Breast Colon Ovarian
Working Groups in areas of interest
U19- based on Cancer Sites
GAME-ON Organizational Structure
Executive Committee
(decisional body)5 P.I.s
NCI program Officers
Epigenetics TERT-CLPTM1L
Clinical and Epidemiological Working Group
• Established July 2010; Chairs – Roz Eeles, Rayjean Hung• Major initiatives
‒ Pan-cancer meta-analysis across sites, ‒ Inflammation pathway, ‒ pleiotropy analysis (with PAGE)
• Inventory of data and biospecimens‒ 96,205 cases/241,880 controls‒ whole blood derived DNA for 70585 cases and 81,673 controls‒ 3317 fresh frozen tumors‒ 14,150 FFPE tumors
• Virtual pathology network• Managing data harmonization across sites
Analytical and Risk Modeling Working Group
• Established July 2010, Chair – Peter Kraft• Held regular webinars about innovative
methods• Provided leadership for integration of data
from HapMap 2• Provided guidelines for ongoing imputations of
1000 Genomes imputation• Developed and implemented novel meta-
analytical methods• Evaluated approaches for detecting gene x
environment effects
Sporadic Lung Cancer Genome Wide Chr 5p Association
Wang, et al. Nature Genetics 40(12): 1407–1409 (2008)
There is Marked Heterogeneity in Associations for Chromosome 5p region
Evaluation of 5p Genes as Potential Lung Tumorigenesis Modifiers in a shRNA/ KRASLSL-G12D/+ Mouse Model
Loss of CLPTM1L sensitizes lung tumor cells to genotoxic agents
Prevention and CessationThe Public Health Mission
Initiation
Cigarette Use
Nicotine Dependence
Cessation and Remission
The Tobacco and Genetics Consortium (2010) Nature Genetics
Chromosome 15q25 Is Important for Smoking
CHRNA5-A3-B4
Genetics of Smoking CessationThough the strong genetic effect of the a5
nicotinic receptor contributes to heavy smoking, little to no effect is seen for smoking cessation.
University of Wisconsin Timothy Baker and
Colleagues
Survival Analysis Smoking Cessation
N=1,015
Chen et al., 2012
Genetic Effect Seen in the Placebo Group
Treatment group (N=898)
Placebo group (N=117)
Entire Sample (N=1015)
UW-TTURC study.Haplotypes based on 2 SNPs (rs16969968, rs680244)H1=GC 20.8%H2=GT 43.7%H3=AC 35.5%
Chen et al., 2012
H1 H2 H30.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.00
0.62
0.37
0.981.11 1.13
PlaceboTreatment
OR
(A
bstin
ence
)
Haplotypes
A Significant Genotype by Treatment Interaction
Ref
eren
ce
Haplotypes (rs16969968, rs680244)
H1=GC(20.8%)H2=GT(43.7%)H3=AC(35.5%)
Chen et al., 2012
H1 H2 H30.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.00
0.62
0.37
PlaceboTreatment
Ref
eren
ce
OR
(A
bstin
ence
)
Haplotypes
Haplotypes predict abstinence in individuals receiving placebo medication
Chen et al., 2012
H1 H2 H30.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.00
0.62
0.37
0.981.11 1.13
PlaceboTreatment
OR
(A
bstin
ence
)
Haplotypes
A Significant Genotype by Treatment Interaction
Ref
eren
ce
(X2=8.97, df=2, p=0.01)
Chen et al., 2012
H1 H2 H30.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.000.981.11 1.13
PlaceboTreatment
OR
(A
bstin
ence
)
Haplotypes
Haplotypes do NOT predict abstinence in individuals receiving pharmacologic treatment
Ref
eren
ce
Chen et al., 2012
H1 H2 H30.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.00
0.37
1.13
PlaceboTreatment
OR
(A
bstin
ence
)
Haplotypes
Smokers with the high risk haplotype are more likely to respond to pharmacologic treatment
Ref
eren
ce
Chen et al., 2012
H1 H2 H30.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.000.98
PlaceboTreatment
OR
(A
bstin
ence
)
Haplotypes
Smokers with the low risk haplotype do NOT benefit from pharmacologic treatment
Ref
eren
ce
Chen et al., 2012
The genotype by treatment effect do not differ across pharmacologic treatments.
Place
bo
Bupro
prion
only
NRT only
Combin
ed0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
H1H2H3
Abstinence (Proportion)
No difference in haplotypic risks on cessation across medication groups (wald=1.16, df=3, p=0.88)
Chen et al., 2012
ConclusionsGWAS Studies have been amazingly
productive in identifying new genetic architecture of common influences for complex diseases
Evaluating the actual functional effects of identified variants can be challenging, relatively few causal variants are directly identified
Follow-up studies have identified significant effects of genetic loci on cancer development
Genetic loci identified by GWAS analyses have provided insights into cancer prevention
Sequencing is useful but expensive, targeted applications are best
U19 Other Sources CIDR GenotypeLung 7,000 0 40,000 Toronto, ChinaOvary 5,000 0 40,000 CambridgeColorectal
5,000 0 40,000 USC
Breast 6,000 77,000
Genome CanadaEUCR-UKDKFZBCFRMDEIE
40,000 Cambridge, Genome Quebec
Prostate 45,000
46,000
NCIBHMovember
40,000 Cambridge, USC, NHGRI
BRCA1/2 Carriers
0 20,000
10,000 Genome Quebec,Cambridge
TOTAL 68,000
143,000
210,000
Oncochip - $49/sample