characterizing snps using genomic information

Characterizing SNPs Using Genomic Information

Ivan P. Gorlov, Olga Y. Gorlova & Christopher I. Amos

Colon cancer results from genetic alterations in multiple genes

Inherited mutations in the APC gene dramatically increase risk of colon cancer

Cancer development reflects early germline susceptibility as well as subsequent mutations that promote cancer

Lu et al., 2014

The continuum of variation

Chung et al.Carcinogenesis 31:111-120, 2010

Linkage

Sequencing

Association

SNPs are the most common type of polymorphism in the human genome

SNPs occur ~ every 400 bases

Minor allele frequency describes prevalence of rarer variant (0,0.5]

Very Few (<0.01%) SNPs are associated with diseases

SNPs are the bread of Genome Wide Association Studies (GWAS)

Success requires inference about effects from Causal Variants

GWAS Have Been Highly Successful in Finding Novel Assocations

Manhattan plot of all lung cancers from 1000 Genomes Imputation

TP63

hTERT

hMSH5BRCA2

CHRNA5

CHEK2

There are multiple signals on Chr 15q.25 - sequencing of 96 people

Exon 476,667,807 GG->GA rs2229961 exon regionVal134->ILE134

Exon 576,669,980 GG->AA rs16969968 exon regionAsp398->Asn398 (negatively charged to positively charged)

76,669,876 TT->TA exon regionLeu363->Glu363 (hydrophobic to hydrophilic)

76,669,781 CC->CT exon regionThr331->Thr331

76,669,288 AA->GA exon regionLys167->Arg167

76,669,665-6 AC deletion exon regionReading frame shift after codon292

Observed in 4 casesIntron 4 76668145

Additional variation (small microsatellites and in/dels) in the promoters of both CHRNA5 and CHRNA3

sdSNPs are deleterious enough to impair gene function and increase disease risk.

Their effects, however, are not strong enough for selection to eliminate them from population (genetic drift, founder effects, and population bottlenecks are factors that help retain sdSNPs).

We estimated the proportion of nonsynonymous SNPs predicted to be functional by PolyPhen and SIFT in different MAF categories

PolyPhen SIFT

Proportions of functional SNPs in different MAF categories

The Erudite Hypothesis:

Large fraction of the genetic susceptibility influenced by rare (<5%) variants with relatively strong effect size

Targeting rarer variants for analysis may detect causal variants when sample size is large.

The majority of the significant SNPs detected by GWAS are SNPs tagging untyped causal variants – greatly reduces power – but indicates region for further study

Brute force analysis is not going to be sufficient to overcome power loss for multiple comparison – have to upweight analyses for most likely causal variants

MAF and a Required Sample Size

The numbers of cases and controls that are required in an association study to detect disease variants with allelic odds ratios of 1.2 (red), 1.3 (blue), 1.5 (yellow), and 2 (black). Numbers shown are for a statistical power of 80% at a significance level of P <10-6.

SNP Replication RateGWAS replication rate is low: about 5%This suggest that GWAS produce a considerable number of false discoveries

Development of methods to predict which SNP will be replicated and which will not is important – relevant for seq.

We evaluated this approach using the results of published GWASs

Evaluating SNP replicationsRetrieved all data from the Catalog of

Published GWA Studies (http://www.genome.gov/26525384/), Sep 13

Restricted analysis to SNPs mapping to a single gene and associated with a disease rather than variation

SNPs related to 106 diseases were studied, 2659 SNPs from 512 studies

SNP findings sorted by date with the first report denoted as a discovery and subsequent studies may be replications

Reproducibility score modeled as the ratio of successful replications over the total number of subsequent studies.

Characteristics Studied

3' UTR, 3' Downstream, 5' Upstream, 5' UTR, Coding nonsynonymous, Coding synonymous, Intergenic, Intronic, Non-coding, and Non-coding intronic.

Name Description Source of the data Conservation index

The level of evolutionary conservation of the protein based on the most distant homolog of the human gene

NCBI HomoloGene database: http://www.ncbi.nlm.nih.gov/homologene

eQTL SNP reported as an eQTL for HapMap Data eQTL SNPs identified in lymphoblastic cell lines from HapMap project [16]

Gene Size Size of the gene region in nucleotides NCBI RefSeq database: http://www.ncbi.nlm.nih.gov/refseq/ Growth factor The protein encoded by the linked gene is a

growth factor Gene Ontology (GO) database http://geneontology.org/

Kinase The protein encoded by the linked gene is a kinase

Gene Ontology (GO) database http://geneontology.org/

-Log(P) Minus LOG(P) where P is the P-value reported in CPG

Catalog of Published GWAS (CPG): http://www.genome.gov/26525384

MAF Minor allele frequency Catalog of Published GWAS (CPG): http://www.genome.gov/26525384 . MAF reported in control group were used.

Nuclear Localization

The protein encoded by the linked gene is localized in the nucleus


OMIM Was associated gene in OMIM http://www.ncbi.nlm.nih.gov/omim Plasma Membrane

The protein encoded by the linked gene is localized in plasma membrane


Receptor The protein encoded by the linked gene is a receptor


SNP type Type of the SNP. For the details see materials and methods

Catalog of the Published GWAS (CPG): http://www.genome.gov/26525384

Tissue specific The expression of the linked gene is tissue specific

Tissue specific Gene Expression and Regulation (TiGER) database: http://bioinfo.wilmer.jhu.edu/tiger/

Transcription factor

The protein encoded by the linked gene is a transcription factor


Univariate Test ResultsPredictor Test Statistics P-value

Conservation Index Spearman correlation Spearman R = 0.09 0.0007 eQTL Mann-Whitney U test (M-W U test) Z adjusted = -5.39 1.90E-07 Gene Size Spearman correlation Spearman R = -0.11 0.00001 Growth Factor Mann-Whitney U test (M-W U test) Z adjusted = -2.79 0.008 Kinase Mann-Whitney U test (M-W U test) Z adjusted = -7.84 2.40E-14 -Log(P) Spearman correlation Spearman R = 0.36 2.30E-08 MAF Spearman correlation Spearman R = 0.09 0.0007 Nuclear Localization Mann-Whitney U test (M-W U test) Z adjusted = -7.18 2.20E-12 OMIM Mann-Whitney U test (M-W U test) Z adjusted = 8.16 2.20E-15 Plasma Membrane Mann-Whitney U test (M-W U test) Z adjusted = -6.23 1.80E-09 Receptor Mann-Whitney U test (M-W U test) Z adjusted = -5.17 8.90E-07 SNP Type Kruskal-Wallis (KW) test (KW test) Chi-Square = 391. 8, df = 9 5.50E-79 Tissue Specific Mann-Whitney U test (M-W U test) Z adjusted = -3.97 0.0001 Transcription Factor Mann-Whitney U test (M-W U test) Z adjusted = 3.32 0.008

Further evaluating SNP Type(1) All pair-wise comparisons inside the

group should be insignificant; and (2) All pairwise comparisons between the groups should be significant.

SNP reproducibility was lowest in the group 1 (5’ UTR, Intergenic, 5’ Upstream, Non-coding intronic, Intronic), intermediate in the group 2 (Coding synonymous, 3’ Downstream, Non-coding), and highest in the group 3 (Coding nonsynonymous, 5’UTR).

43 SNPs had >1 annotation

Univariate Predictors of Replication

Multivariable Analysis of PredictorsCharacteristic B se(B) p-valueSNP_TYPE 0.034687 0.006443 7.31E-08Pvalue_mlog 0.000936 0.000187 5.34E-075_MAF_Groups 0.005509 0.001293 2.03E-05

OMIM 0.021424 0.007765 0.0058Nuclear localization 0.025213 0.008232 0.002192

Kinases -0.01213 0.014694 0.409105Conservation index 0.052759 0.04749 0.266581Growth factor -0.01789 0.020355 0.379Tissue_specific -0.00375 0.007339 0.609778eQTL HapMap -0.00122 0.007148 0.864

Receptors 0.019558 0.012282 0.111304

Transcription factors -0.02182 0.012491 0.080629Gene Size 0.011653 0.007745 0.132435

Interactions among factors

Identification of Causal VariantsShowing a variants has a causal effect on a

disease process is challengingGood candidate for the causal variant should be

linked to relevant biological mechanism. For example it may have effect on gene expression or protein structure/folding.

Functional analysis is needed to prove that (1) SNP change the important biological function, and that (2) alteration of function is associated with disease risk.

ConclusionsMultivariable analysis showed that 11% of

variability was explained by SNP predictors with 5% attributed to –log(P) value alone.

SNP characteristics are second most prominent, followed by MAF (in wrong direction for weighting)

Could also define profile measurement

Number of GWA Studies Evaluated

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180

5

10

15

20

25

30

35

Num

ber o

f Diff

eren

t Dis

ease

s Stu

died

Number of Replicates

Selecting SNPs for ValidationSelection SNPs for validation stage is trickyThe common approach is based on the

ranking SNPs based on P-values from the discovery stage and select the top SNPs

People also take into account the region: SNPs located in the region detected by linkage analysis of associated with genes from candidate pathway are considered to be a good candidate for replication

Are GWAS-Detected SNPs Tagging or Causal Variants

Even large genotyping studies e.g. genotyping 1,000,000 SNPs cover only a fraction of genetic variation in the human genome: there are more than 70,000,000 SNPs in the human genome

Causal variant may or may not be on a genotyping platform but is likely to be captured by sequencing studiesShort repeats are not detectableSome regions of the genome are not yet mapped Inversions are difficult to identify

Improving power to detect uncommon variants Single variant as disease locus*Hardy-Weinberg Equilibrium in controlsAffection status modeled by a penetrance

table. case-control samples

High risk allele A with frequency pLow risk allele B with frequency q

Power calculationactual sample size needed for genotype-

phenotype relationship to riskPeng et al. Hum Genet. 2010 Jun;127(6):699-704.*Multiple variants in a region can be modeled using burden tests

Power = 0.8 Relative risk A: 4, Disease prevalence: 0.05,Disease allele frequency: p=0.01, MAF 0.01, significance 1 x 10 -7

Discovery, Expansion,and Replication

• Find new associations through pooled or meta-analysis

• Independent replication studies to confirm genotype-phenotype associations

• Fine-map of association signals

Biological Studies

• Identify risk-enhancing genetic variants• Examine functional consequences of a

genetic variant• Determine biological mechanisms of risk

enhancement

Epidemiologic Studies

• Evaluate gene-gene and gene-environment interactions

• Assess penetrance and population attributable risk

• Develop complex risk models• Evaluate clinical/analytic validity of risk

models in observational studies

GAME-ON Integrative Projects

Steering Committee(policies and processes)

2 voting members from each U19Working Groups ChairsNCI Program Officers

Next Generation Genomics

TechnologiesAnalytic and Risk

Modeling Functional Assays Epidemiology and Clinical

ExternalAdvisory

Committee

Lung Prostate Breast Colon Ovarian

Working Groups in areas of interest

U19- based on Cancer Sites

GAME-ON Organizational Structure

Executive Committee

(decisional body)5 P.I.s

NCI program Officers

Epigenetics TERT-CLPTM1L

Clinical and Epidemiological Working Group

• Established July 2010; Chairs – Roz Eeles, Rayjean Hung• Major initiatives

‒ Pan-cancer meta-analysis across sites, ‒ Inflammation pathway, ‒ pleiotropy analysis (with PAGE)

• Inventory of data and biospecimens‒ 96,205 cases/241,880 controls‒ whole blood derived DNA for 70585 cases and 81,673 controls‒ 3317 fresh frozen tumors‒ 14,150 FFPE tumors

• Virtual pathology network• Managing data harmonization across sites

Analytical and Risk Modeling Working Group

• Established July 2010, Chair – Peter Kraft• Held regular webinars about innovative

methods• Provided leadership for integration of data

from HapMap 2• Provided guidelines for ongoing imputations of

1000 Genomes imputation• Developed and implemented novel meta-

analytical methods• Evaluated approaches for detecting gene x

environment effects

Sporadic Lung Cancer Genome Wide Chr 5p Association

Wang, et al. Nature Genetics 40(12): 1407–1409 (2008)

There is Marked Heterogeneity in Associations for Chromosome 5p region

Evaluation of 5p Genes as Potential Lung Tumorigenesis Modifiers in a shRNA/ KRASLSL-G12D/+ Mouse Model

Loss of CLPTM1L sensitizes lung tumor cells to genotoxic agents

Prevention and CessationThe Public Health Mission

Initiation

Cigarette Use

Nicotine Dependence

Cessation and Remission

The Tobacco and Genetics Consortium (2010) Nature Genetics

Chromosome 15q25 Is Important for Smoking

CHRNA5-A3-B4

Genetics of Smoking CessationThough the strong genetic effect of the a5

nicotinic receptor contributes to heavy smoking, little to no effect is seen for smoking cessation.

University of Wisconsin Timothy Baker and

Colleagues

Survival Analysis Smoking Cessation

N=1,015

Chen et al., 2012

Genetic Effect Seen in the Placebo Group

Treatment group (N=898)

Placebo group (N=117)

Entire Sample (N=1015)

UW-TTURC study.Haplotypes based on 2 SNPs (rs16969968, rs680244)H1=GC 20.8%H2=GT 43.7%H3=AC 35.5%

Chen et al., 2012

H1 H2 H30.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.00

0.62

0.37

0.981.11 1.13

PlaceboTreatment

OR

(A

bstin

ence

)

Haplotypes

A Significant Genotype by Treatment Interaction

Ref

eren

ce

Haplotypes (rs16969968, rs680244)

H1=GC(20.8%)H2=GT(43.7%)H3=AC(35.5%)

Chen et al., 2012

H1 H2 H30.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.00

0.62

0.37

PlaceboTreatment

Ref

eren

ce

OR

(A

bstin

ence

)

Haplotypes

Haplotypes predict abstinence in individuals receiving placebo medication

Chen et al., 2012

H1 H2 H30.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.00

0.62

0.37

0.981.11 1.13

PlaceboTreatment

OR

(A

bstin

ence

)

Haplotypes

A Significant Genotype by Treatment Interaction

Ref

eren

ce

(X2=8.97, df=2, p=0.01)

Chen et al., 2012

H1 H2 H30.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.000.981.11 1.13

PlaceboTreatment

OR

(A

bstin

ence

)

Haplotypes

Haplotypes do NOT predict abstinence in individuals receiving pharmacologic treatment

Ref

eren

ce

Chen et al., 2012

H1 H2 H30.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.00

0.37

1.13

PlaceboTreatment

OR

(A

bstin

ence

)

Haplotypes

Smokers with the high risk haplotype are more likely to respond to pharmacologic treatment

Ref

eren

ce

Chen et al., 2012

H1 H2 H30.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.000.98

PlaceboTreatment

OR

(A

bstin

ence

)

Haplotypes

Smokers with the low risk haplotype do NOT benefit from pharmacologic treatment

Ref

eren

ce

Chen et al., 2012

The genotype by treatment effect do not differ across pharmacologic treatments.

Place

bo

Bupro

prion

only

NRT only

Combin

ed0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

H1H2H3

Abstinence (Proportion)

No difference in haplotypic risks on cessation across medication groups (wald=1.16, df=3, p=0.88)

Chen et al., 2012

ConclusionsGWAS Studies have been amazingly

productive in identifying new genetic architecture of common influences for complex diseases

Evaluating the actual functional effects of identified variants can be challenging, relatively few causal variants are directly identified

Follow-up studies have identified significant effects of genetic loci on cancer development

Genetic loci identified by GWAS analyses have provided insights into cancer prevention

Sequencing is useful but expensive, targeted applications are best

U19 Other Sources CIDR GenotypeLung 7,000 0 40,000 Toronto, ChinaOvary 5,000 0 40,000 CambridgeColorectal

5,000 0 40,000 USC

Breast 6,000 77,000

Genome CanadaEUCR-UKDKFZBCFRMDEIE

40,000 Cambridge, Genome Quebec

Prostate 45,000

46,000

NCIBHMovember

40,000 Cambridge, USC, NHGRI

BRCA1/2 Carriers

0 20,000

10,000 Genome Quebec,Cambridge

TOTAL 68,000

143,000

210,000

Oncochip - $49/sample

characterizing snps using genomic information

Documents