genome-wide association studies
DESCRIPTION
Genome-wide association studies. Usman Roshan. SNP. Single nucleotide polymorphism Specific position and specific chromosome. SNP genotype. Suppose this is the DNA on chromosome 1 starting from position 1. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/1.jpg)
Genome-wide association studies
Usman Roshan
![Page 2: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/2.jpg)
SNP
• Single nucleotide polymorphism• Specific position and specific chromosome
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 3: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/3.jpg)
SNP genotype
Suppose this is the DNA on chromosome 1 starting from position 1.
There is a SNP C/G on position 5, C/T on position 14, and G/T on position 21. This person is heterozygous in the first SNP and homozygous in the other two.
F: AACACAATTAGTACAATTATGACM: AACAGAATTAGTACAATTATGAC
![Page 4: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/4.jpg)
SNP genotype representation
The example
F: AACACAATTAGTACAATTATGACM: AACAGAATTAGTACAATTATGAC
is represented as
CG CC GG …
![Page 5: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/5.jpg)
SNP genotype
• For several individualsA/T C/T G/T …
H0: AA TT GG …H1: AT CC GT …H2: AA CT GT …...
![Page 6: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/6.jpg)
SNP genotype encoding
• If SNP is A/B (alphabetically ordered) then count number of times we see B.
• Previous example becomesA/T C/T G/T … A/T C/T G/T …
H0: AA TT GG … 0 2 0 …H1: AT CC GT … =>1 0 1 …H2: AA CT GT … 0 1 1 …
Now we have data in numerical format
![Page 7: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/7.jpg)
Genome wide association studies (GWAS)
• Aim to identify which regions (or SNPs) in the genome are associated with disease or certain phenotype.
• Design:– Identify population structure– Select case subjects (those with disease)– Select control subjects (healthy)– Genotype a million SNPs for each subject– Determine which SNP is associated.
![Page 8: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/8.jpg)
Example GWAS
A/T C/G A/G …Case 1 AA CC AACase 2 AT CG AACase 3 AA CG AAControl 1 TT GG GGControl 2 TT CC GGControl 3 TA CG GG
![Page 9: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/9.jpg)
Encoded data
A/T C/G A/G A/T C/GA/G
Case1 AA CC AA 0 0 0Case2 AT CG AA 1 1 0Case3 AA CG AA => 0 1 0Con1 TT GG GG 2 2 2Con2 TT CC GG 2 0 2Con3 TA CG GG 1 1 2
![Page 10: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/10.jpg)
Ranking SNPsSNP1 SNP2 SNP3 SNP1 SNP2 SNP3A/T C/G A/G A/T C/G A/G
Case1 AA CC AA 0 0 0Case2 AT CG AA 1 1 0Case3 AA CG AA => 0 1 0Con1 TT GG GG 2 2 2Con2 TT CC GG 2 0 2Con3 TA CG GG 1 1 2
A good ranking strategy would produce SNP3, SNP1, SNP2
![Page 11: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/11.jpg)
Chi-square test
• Gold standard is the univariate non-parametric chi-square test with two degrees of freedom.
• Search for SNPs that deviate from the independence assumption.
• Rank SNPs by p-values
![Page 12: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/12.jpg)
Case-control example• Study of 100 people:
– Case: 50 subjects with cancer– Control: 50 subjects without cancer
• Count number of alleles and form a 2x2 contingency table
• Relative risk:RR = Pr(disease | one copy of risk allele)/
Pr(disease | zero copies of risk allele)(Jewell ‘03)
• Due to sampling we cannot estimate the relative risk from a case-control study
• But we can estimate the odds-ratio
982Control
9010Case
#Allele2#Allele1
![Page 13: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/13.jpg)
Odds ratio• Odds of an event A is defined as Odds(A)= Pr(A)/Pr(~A)
• Odds ratio is the ratio of two odds. For example the ratio of odds of A and B is
OR = Odds(A)/Odds(B) = Pr(A)/Pr(~A) / Pr(B)/Pr(~B)
• Odds ratio of disease and exposed and unexposed groups would be
OR = Odds(D|G=1)/Odds(D|G=0) = = Pr(D|G=1)/Pr(~D|G=1) / Pr(D|G=0)/Pr(~D|G=0) = Pr(D|G=1)/Pr(D|G=0) x Pr(~D|G=0)/Pr(~D|G=1) = RR x Pr(~D|G=0)/Pr(~D|G=1)
![Page 14: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/14.jpg)
Symmetry in odds ratio
• The odds ratio is symmetric in disease and genotype:OR = Odds(D|G=1)/Odds(D|G=0) = = Odds(G|D=1)/Odds(G|D=0)
• Great! Because we can estimate P(G|D) from a case control study. We can now use the OR as an estimate of one’s risk of disease.
![Page 15: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/15.jpg)
Example• Odds of risk allele in case =
(10/100)/(90/100)=1/9• Odds of risk allele in control
= (2/100)/(98/100)=1/49• Odds ratio of risk allele =
49/9982Control
9010Case
#Allele2(wildtype)
#Allele1(risk)
![Page 16: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/16.jpg)
What about significance?
• Okay, so the OR measures the risk. But is it significant? Perhaps it is due to chance.
• Let’s look at the chi-square test for measuring significance.
![Page 17: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/17.jpg)
Statistical test of association (P-values)
• P-value = probability of the observed data (or worse) under the null hypothesis
• Example:– Suppose we are given a series of coin-tosses– We feel that a biased coin produced the tosses– We can ask the following question: what is the probability
that a fair coin produced the tosses?– If this probability is very small then we can say there is a
small chance that a fair coin produced the observed tosses.– In this example the null hypothesis is the fair coin and the
alternative hypothesis is the biased coin
![Page 18: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/18.jpg)
Binomial distribution• Bernoulli random variable:
– Two outcomes: success of failure– Example: coin toss
• Binomial random variable:– Number of successes in a series of independent Bernoulli trials
• Example: – Probability of heads=0.5– Given four coin tosses what is the probability of three heads? – Possible outcomes: HHHT, HHTH HTHH, HHHT– Each outcome has probability = 0.5^4– Total probability = 4 * 0.5^4
![Page 19: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/19.jpg)
Binomial distribution
• Bernoulli trial probability of success=p, probability of failure = 1-p
• Given n independent Bernoulli trials what is the probability of k successes?
• Binomial applet: http://www.stat.tamu.edu/~west/applets/binomialdemo.html
€
nk ⎛ ⎝ ⎜
⎞ ⎠ ⎟pk (1− p)n−k
![Page 20: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/20.jpg)
Hypothesis testing under Binomial hypothesis
• Null hypothesis: fair coin (probability of heads = probability of tails = 0.5)
• Data: HHHHTHTHHHHHHHTHTHTH• P-value under null hypothesis = probability
that #heads >= 15• This probability is 0.021• Since it is below 0.05 we can reject the null
hypothesis
![Page 21: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/21.jpg)
Chi-square statistic
• Define four random variables Xi each of which is binomially distributed Xi ~ B(n, pi) where n=c1+c2+c3+c4 is the total number of subjects and pi is the probability of success of Xi.
• Each variable Xi represents the number of case and control subjects with number of risk and wildtype alleles.
• The expected value E(Xi) = npi since each Xi is binomial.
c4 (X4)c3 (X3)Control
c2 (X2)c1 (X1)Case
#Allele2 (wildtype)
#Allele1 (risk)
![Page 22: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/22.jpg)
Chi-square statisticDefine the statistic:
where ci = observed frequency for ith outcomeei = expected frequency for ith outcomen = total outcomes
The probability distribution of this statistic is given by thechi-square distribution with n-1 degrees of freedom.Proof can be found at http://ocw.mit.edu/NR/rdonlyres/Mathematics/18-443Fall2003/4226DF27-A1D0-4BB8-939A-B2A4167B5480/0/lec23.pdf
Great. But how do we use this to get a SNP p-value?
€
χ 2 = (ci − ei )2
eii=1
n
∑
![Page 23: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/23.jpg)
Null hypothesis for case control contingency table
• We have two random variables:– D: disease status– G: allele type.
• Null hypothesis: the two variables are independent of each other (unrelated)
• Under independence – P(D,G)= P(D)P(G)– P(D=case) = (c1+c2)/n– P(G=risk) = (c1+c3)/n
• Expected values– E(X1) = P(D=case)P(G=risk)n
• We can calculate the chi-square statistic for a given SNP and the probability that it is independent of disease status (using the p-value).
• SNPs with very small probabilities deviate significantly from the independence assumption and therefore considered important.
c4c3Control
c2c1Case
#Allele2(wildtype)
#Allele1(risk)
![Page 24: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/24.jpg)
Chi-square statistic exercise
482Control
3515Case
#Allele2#Allele1• Compute expected valuesand chi-square statistic• Compute chi-square p-value by referring tochi-square distribution
![Page 25: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/25.jpg)
Logistic regression• The odds ratio estimated from the contingency table directly has a
skewed sampling distribution.• A better (discriminative) approach is to model the log likelihood ratio
log(Pr(G|D=case)/Pr(G|D=control)) as a linear function. In other words:
• Why:– Log likelihood ratio is a powerful statistic– Modeling as a linear function yields a simple algorithm to estimate
parameters• G is number of copies of the risk allele • With some manipulation this becomes
€
Pr(D = case |G) = 11+ e−(wTG+w0 )
€
log( Pr(G |D = case)Pr(G |D = control)
) = wTG + w0
![Page 26: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/26.jpg)
How do we get the odds ratio from logistic regression? (I)
• Using Bayes rule we have
log(Pr(G | D =χase)
Pr(G |D =χontrol))=wTG +w0
Pr(G | D =χase)Pr(G |D =χontrol)
=ewTG +w0
Pr(G =1|D =χase)Pr(G =1|D =χontrol)Pr(G =0 |D =χase)
Pr(G =0 |D =χontrol)
=ew
T 1+w0
ewT 0+w0
=ew
And by taking the ratio with G=1 and G=0 we get
By exponentiating both sides we get
![Page 27: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/27.jpg)
How do we get the odds ratio from logistic regression? (II)
Pr(G =1|D =χase)Pr(G =1|D =χontrol)Pr(G =0 |D =χase)
Pr(G =0 |D =χontrol)
=
Pr(G =1|D =χase)Pr(G =0 |D =χase)Pr(G =1|D =χontrol)Pr(G =0 |D =χontrol)
Since the original ratio (see previous slide) is equal to ew and is equal to the odds ratio we conclude that the odds ratio is given by this value.
Continued from previous slide: by rearranging the terms in the numerator and denominator we get
Pr(G =1|D =χase)Pr(G =0 |D =χase)Pr(G =1|D =χontrol)Pr(G =0 |D =χontrol)
=
Pr(D =χase |G =1)Pr(D =χontrol|G =1)Pr(D =χase |G =0)
Pr(D =χontrol|G =0)
=OR (odds ratio)
By symmetry of odds ratio this is
![Page 28: Genome-wide association studies](https://reader036.vdocument.in/reader036/viewer/2022062305/56815cb0550346895dcaae4a/html5/thumbnails/28.jpg)
How to find w and w0?
• And so ew is our odds ratio. But how do we find w and w0?– We assume that one’s disease status D
given their genotype G is a Bernoulli random variable.
– Using this we form the sample likelihood– Differentiate the likelihood by w and w0
– Use gradient descent