genome-wide association studies

Genome-wide association studies

Usman Roshan

SNP

• Single nucleotide polymorphism• Specific position and specific chromosome

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

SNP genotype

Suppose this is the DNA on chromosome 1 starting from position 1.

There is a SNP C/G on position 5, C/T on position 14, and G/T on position 21. This person is heterozygous in the first SNP and homozygous in the other two.

F: AACACAATTAGTACAATTATGACM: AACAGAATTAGTACAATTATGAC

SNP genotype representation

The example

F: AACACAATTAGTACAATTATGACM: AACAGAATTAGTACAATTATGAC

is represented as

CG CC GG …

SNP genotype

• For several individualsA/T C/T G/T …

H0: AA TT GG …H1: AT CC GT …H2: AA CT GT …...

SNP genotype encoding

• If SNP is A/B (alphabetically ordered) then count number of times we see B.

• Previous example becomesA/T C/T G/T … A/T C/T G/T …

H0: AA TT GG … 0 2 0 …H1: AT CC GT … =>1 0 1 …H2: AA CT GT … 0 1 1 …

Now we have data in numerical format

Genome wide association studies (GWAS)

• Aim to identify which regions (or SNPs) in the genome are associated with disease or certain phenotype.

• Design:– Identify population structure– Select case subjects (those with disease)– Select control subjects (healthy)– Genotype a million SNPs for each subject– Determine which SNP is associated.

Example GWAS

A/T C/G A/G …Case 1 AA CC AACase 2 AT CG AACase 3 AA CG AAControl 1 TT GG GGControl 2 TT CC GGControl 3 TA CG GG

Encoded data

A/T C/G A/G A/T C/GA/G

Case1 AA CC AA 0 0 0Case2 AT CG AA 1 1 0Case3 AA CG AA => 0 1 0Con1 TT GG GG 2 2 2Con2 TT CC GG 2 0 2Con3 TA CG GG 1 1 2

Ranking SNPsSNP1 SNP2 SNP3 SNP1 SNP2 SNP3A/T C/G A/G A/T C/G A/G

Case1 AA CC AA 0 0 0Case2 AT CG AA 1 1 0Case3 AA CG AA => 0 1 0Con1 TT GG GG 2 2 2Con2 TT CC GG 2 0 2Con3 TA CG GG 1 1 2

A good ranking strategy would produce SNP3, SNP1, SNP2

Chi-square test

• Gold standard is the univariate non-parametric chi-square test with two degrees of freedom.

• Search for SNPs that deviate from the independence assumption.

• Rank SNPs by p-values

Case-control example• Study of 100 people:

– Case: 50 subjects with cancer– Control: 50 subjects without cancer

• Count number of alleles and form a 2x2 contingency table

• Relative risk:RR = Pr(disease | one copy of risk allele)/

Pr(disease | zero copies of risk allele)(Jewell ‘03)

• Due to sampling we cannot estimate the relative risk from a case-control study

• But we can estimate the odds-ratio

982Control

9010Case

#Allele2#Allele1

Symmetry in odds ratio

• The odds ratio is symmetric in disease and genotype:OR = Odds(D|G=1)/Odds(D|G=0) = = Odds(G|D=1)/Odds(G|D=0)

• Great! Because we can estimate P(G|D) from a case control study. We can now use the OR as an estimate of one’s risk of disease.

Example• Odds of risk allele in case =

(10/100)/(90/100)=1/9• Odds of risk allele in control

= (2/100)/(98/100)=1/49• Odds ratio of risk allele =

49/9982Control

9010Case

#Allele2(wildtype)

#Allele1(risk)

What about significance?

• Okay, so the OR measures the risk. But is it significant? Perhaps it is due to chance.

• Let’s look at the chi-square test for measuring significance.

Statistical test of association (P-values)

• P-value = probability of the observed data (or worse) under the null hypothesis

• Example:– Suppose we are given a series of coin-tosses– We feel that a biased coin produced the tosses– We can ask the following question: what is the probability

that a fair coin produced the tosses?– If this probability is very small then we can say there is a

small chance that a fair coin produced the observed tosses.– In this example the null hypothesis is the fair coin and the

alternative hypothesis is the biased coin

Binomial distribution• Bernoulli random variable:

– Two outcomes: success of failure– Example: coin toss

• Binomial random variable:– Number of successes in a series of independent Bernoulli trials

• Example: – Probability of heads=0.5– Given four coin tosses what is the probability of three heads? – Possible outcomes: HHHT, HHTH HTHH, HHHT– Each outcome has probability = 0.5^4– Total probability = 4 * 0.5^4

Binomial distribution

• Bernoulli trial probability of success=p, probability of failure = 1-p

• Given n independent Bernoulli trials what is the probability of k successes?

• Binomial applet: http://www.stat.tamu.edu/~west/applets/binomialdemo.html

€

nk ⎛ ⎝ ⎜

⎞ ⎠ ⎟pk (1− p)n−k

Hypothesis testing under Binomial hypothesis

• Null hypothesis: fair coin (probability of heads = probability of tails = 0.5)

• Data: HHHHTHTHHHHHHHTHTHTH• P-value under null hypothesis = probability

that #heads >= 15• This probability is 0.021• Since it is below 0.05 we can reject the null

hypothesis

Chi-square statistic

• Define four random variables Xi each of which is binomially distributed Xi ~ B(n, pi) where n=c1+c2+c3+c4 is the total number of subjects and pi is the probability of success of Xi.

• Each variable Xi represents the number of case and control subjects with number of risk and wildtype alleles.

• The expected value E(Xi) = npi since each Xi is binomial.

c4 (X4)c3 (X3)Control

c2 (X2)c1 (X1)Case

#Allele2 (wildtype)

#Allele1 (risk)

Chi-square statisticDefine the statistic:

where ci = observed frequency for ith outcomeei = expected frequency for ith outcomen = total outcomes

The probability distribution of this statistic is given by thechi-square distribution with n-1 degrees of freedom.Proof can be found at http://ocw.mit.edu/NR/rdonlyres/Mathematics/18-443Fall2003/4226DF27-A1D0-4BB8-939A-B2A4167B5480/0/lec23.pdf

Great. But how do we use this to get a SNP p-value?

€

χ 2 = (ci − ei )2

eii=1

n

∑

Null hypothesis for case control contingency table

• We have two random variables:– D: disease status– G: allele type.

• Null hypothesis: the two variables are independent of each other (unrelated)

• Under independence – P(D,G)= P(D)P(G)– P(D=case) = (c1+c2)/n– P(G=risk) = (c1+c3)/n

• Expected values– E(X1) = P(D=case)P(G=risk)n

• We can calculate the chi-square statistic for a given SNP and the probability that it is independent of disease status (using the p-value).

• SNPs with very small probabilities deviate significantly from the independence assumption and therefore considered important.

c4c3Control

c2c1Case

#Allele2(wildtype)

#Allele1(risk)

Chi-square statistic exercise

482Control

3515Case

#Allele2#Allele1• Compute expected valuesand chi-square statistic• Compute chi-square p-value by referring tochi-square distribution

Logistic regression• The odds ratio estimated from the contingency table directly has a

skewed sampling distribution.• A better (discriminative) approach is to model the log likelihood ratio

log(Pr(G|D=case)/Pr(G|D=control)) as a linear function. In other words:

• Why:– Log likelihood ratio is a powerful statistic– Modeling as a linear function yields a simple algorithm to estimate

parameters• G is number of copies of the risk allele • With some manipulation this becomes

€

Pr(D = case |G) = 11+ e−(wTG+w0 )

€

log( Pr(G |D = case)Pr(G |D = control)

) = wTG + w0

How to find w and w0?

• And so ew is our odds ratio. But how do we find w and w0?– We assume that one’s disease status D

given their genotype G is a Bernoulli random variable.

– Using this we form the sample likelihood– Differentiate the likelihood by w and w0

– Use gradient descent

genome-wide association studies

Documents

ratio of odds

odds ratio of risk allele

odds ratiothe odds ratio

b odds ratio of disease

exampleodds of risk

snp genotype encodingif

copy of risk allele

casecontrol studybut