comp. genomics recitation 10 4/7/09 differential expression detection

25
Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Upload: denis-lewis

Post on 18-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Comp. Genomics

Recitation 104/7/09Differential expression detection

Page 2: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Outline

• Clustering vs. Differential expression• Fold change• T-test• Multiple testing• FDR/SAM• Mann-Whitney• Examples

Page 3: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Microarray preliminaries

• General input: A matrix of probes (sequences) and intensities

• We assume the hard work is over:• Probes are assigned to genes• The data is properly (?) normalized• We have an expression matrix

• Rows correspond to genes• Columns correpond to conditions

Page 4: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Microarray analysis

• Common scenarios:• We tested the behavior of genes across

several time points• We test a large number of different

condtions• Clustering is the solution

• We compared a small number of conditions (2) and have multiple replicates for each condition

• E.g., we took blood expression in 10 sick and 10 individuals • Differential expression analysis

Page 5: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Identification of differential genes

• The most basic experimental design: comparison between 2 conditions – ‘treatment’ vs. control• More complex: sick/treatment/control

• The goal: identify genes that are differentially expressed in the examined conditions

• Number of replicates is usually low (n=2-4)• Statistics are important

Slides: Rani Elkon

Page 6: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Approaches for identification of differential genes

1.1. Fold ChangeFold Change2.T-test3.SAM

Page 7: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

1. Fold Change

• Consider genes whose mean expression level was change by at least 1.75-2 fold as differential genes

• Pros:• Very simple!

• Cons:• Usually no estimation of false positive

rate is provided• Biased to genes with low expression level• Ignores the variability of gene levels over

replicates.

Page 8: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Fold Change limit – Biased to low expression levels

Determine ‘floor’ cut-off and set all expression levels below it to this floor level

Page 9: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Fold Change limit – ignores variability over replicates

• We need a score that ‘punishes’ genes with high variability over replicates

Page 10: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

1.Fold Change

2.2. T-testT-test3.SAM

Approaches for identification of differential genes

Page 11: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

2. T-test

• Compute a t-score for each gene

mc, mt – mean levels in Control and Treatment

Sc2, St

2 – variance estimates in Control and Treatment

nc, nt – number of replicates in in Control and Treatment

Page 12: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

T-test

• The t-score is good because it is a results of a well known statistical hypothesis testing

• If we assume the sample is normally distributed (unknown variance) and compare two hypotheses:• H0 – All the measurements come from the same

distribution

• H1 – All the measurements come from different normal distributions

• In this case a p-value can be derived for every t-score

Page 13: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

T-test

• Set cut-off for p-value (α=0.01) and consider all genes with p-value < α as differential genes

Page 14: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Multiple Testing

• Pg associated with the t-score tg is the probability for obtaining by random a t-score that is at least as extreme as tg.

• Multiplicity problem: thousands of genes are tested simultaneously (all the genes on the array!)

• Simple example:• 10,000 genes on a chip• not a single one is differentially expressed

(everything is random)• α=0.01• 10000x0.01 = 100100 genes are expected to have

a p-value < 0.01 just by chance.

Page 15: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Multiple testing

• Individual p–values of e.g. 0.01 no longer correspond to significant findings.

• Need to adjust for multiple testing when assessing the statistical significance of findings

• Actually this is a somewhat common problem in statistics

Page 16: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Multiple Testing

• Simple solution (Bonferroni): consider as differential genes only those with p-value < (α/N)• N: number of tests• α=0.01, N=10,000: cut-off=0.000001

• Ensure very low probability for having any false positive genes (less than α)

• Advantage: very clean list of differential genes• Limit: the list usually contains very few genes…

unacceptable high rate of false negatives

Page 17: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

FDR correction (Benjamini & Hochberg)

• False Discovery Rate• In high-throughput studies certain

proportion of false positives is tolerable• Control the expected proportion of false

positives among the genes declared as differential (q=10%).

• Scheme:• Rank genes according to their p-vals: p(1)<p(2)…

<p(N)

• Consider as differential the top k genes, wherek = max{i: p(i)< i*(q/N)}

Page 18: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

1.Fold Change2.T-test

3.3. SAMSAM

Approaches for identification of differential genes

Page 19: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

3. SAM (Tusher, Tibshirani & Chu)

• ‘Significance Analysis of Microarray’• Limit of analytical FDR approach:

assumes that the tests are independent

• In the microarray context, the expression levels of some genes are highly correlated → unreliable FDR estimate

• SAM uses permutations to get an ‘empirical’ estimate for the FDR of the reported differential genes

Page 20: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

SAM

• Scheme:• Compute for each gene a statistic that measures

its relative expression difference in control vs ‘treatment’ (t-score or a variant)

• Rank the genes according to their ‘difference score’

• Set a cut off (d0) and consider all genes above it as differential (Nd)

• Permute the condition labels, and count how many genes got score above d0 (Np)

• Repeat on many (all possible) permutations and count (Npj)

• estimate FDR as the proportion: Average(Npj)/Nd

Page 21: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Permutation on condition labels

D scor

e

G1 e11

e12

e13

e14

e15

e16

e17

e18

d1

G2 e21

e22

e23

e24

e25

e26

e27

e28

d2

G3 e31

e32

e33

e34

e35

e36

e37

e38

d3

d1p1

d2p1

d3p1

d1p2

d2p2

d3p2

BACK

Page 22: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

SAM example

• Ionizing radiation response experiment

• After setting the threshold:• 46 genes found significant

• 36 permutations• 8.4 genes on average pass the

threshold

• False discovery rate is 18%

Page 23: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Mann-Whitney/Wilcoxon

• In general normality assumption of t-test is problematic

• Aparametric statistics are very useful in many bioinfo related problem

• Assume nothing about the distribution of the samples

• Less powerful (more false negatives, but less false positives)

Page 24: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Mann-Whitney/Wilcoxon

• MW/Wilcoxon test for two samples:• H0 – The medians of both distributions

are the same

• H1 – The medians of the distributions are different

• Assumes:• The two samples are independent• The observations can ordered (ordinal)

Page 25: Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Mann-Whitney/Wilcoxon

• Computes a U-score whose distribution is known under H0 (& can be approximated by normal distribution in large samples)• Arrange all the observations into a single

ranked series• Add up the ranks in sample 1. The sum of

ranks in sample 2 follows by calculation, since the sum of all the ranks equals N(N+1)/2

• U-score: