![Page 1: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/1.jpg)
Statistical analysis of expression data:
Normalization, differential expression and multiple testing
Jelle Goeman
![Page 2: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/2.jpg)
Outline
NormalizationExpression variationModeling the log Fold changeComplex designsShrinkage and empirical Bayes (limma)Multiple testing (False Discovery Rate)
![Page 3: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/3.jpg)
Measuring expression
![Page 4: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/4.jpg)
Platforms
Microarrays
RNAseq
Common: Need for normalizationBatch effects
![Page 5: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/5.jpg)
Why normalization
Some experimental factors cannot be completely controlledAmount of materialAmount of degradationPrint tip differencesQuality of hybridization
Effects are systematicCause variation between samples and
between batches
![Page 6: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/6.jpg)
What is normalization?
Normalization =
An attempt to get rid of unwanted systematic variation by statistical means
Note 1: this will never completely succeedNote 2: this may do more harm than good
Much better, but often impossible
Better control of the experimental conditions
![Page 7: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/7.jpg)
How do normalization methods work?General approach
1.Assume: data from an ideal experiment would have characteristic AE.g. mean expression is equal for each sample
Note: this is an assumption!
2. If the data do not have characteristic A, change the data such that the data now do have characteristic AE.g. Multiply each sample’s expression by a factor
![Page 8: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/8.jpg)
Example: quantile normalization
Assume: “Most probes are not differentially expressed”
“As many probes are up and downregulated”
Reasonable consequence:The distribution of the expression values is identical for each sample
Normalization:Make the distribution of expression values identical for each sample
![Page 9: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/9.jpg)
Quantile normalization in practice
Choose a target distributionTypically the average of the measured distributionsAll samples will get this distribution after normalization
Quantile normalization: Replace the ith largest expression value in each sample by the
ith largest value in the target distribution
Consequence: Distribution of expressions the same between samples Expressions for specific genes may differ
![Page 10: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/10.jpg)
Less radical forms of normalization
Make the means per sample the sameMake the medians the sameMake the variances the sameLoess curve smoothing
Same idea, but less change to the data
![Page 11: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/11.jpg)
Overnormalizing
Normalizing can remove or reduce true biological differencesExample: global increase in expression
Normalization can create differences that are not thereExample: almost global increase in expression
Usually: normalization reduces unwanted variation
![Page 12: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/12.jpg)
Batch effects
Differences between batches are even stronger than between samples in the same batch
Note: batch effects at several stages
Normalization is not sufficient to remove batch-effects
Methods available (comBat) but not perfectBest: avoid batch effects if possible
![Page 13: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/13.jpg)
Confounding by batch
Take care of batch-effects in experimental design
Problem: confounding of effect of interest by batch effects
Example: Golub data
Solution: balance or randomize
![Page 14: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/14.jpg)
Expression variation
![Page 15: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/15.jpg)
Differential expression
Two experimental conditionsTreated versus untreated
Two distinct phenotypesTumor versus normal tissue
Which genes can reliably be called differentially expressed?
Also: continuous phenotypesWhich gene expressions are correlated with phenotype?
![Page 16: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/16.jpg)
Variation in gene expression
Technical variationVariation due to measurement techniqueVariability of measured expression from experiment to
experiment on the same subject
Biological variationVariation between subjects/samplesVariability of “true” expression between different
subjects
Total variationSum of technical and biological variation
![Page 17: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/17.jpg)
Reliable assessment
Two samples always have different expressionMaybe even a high fold changeDue to random biological and technical variation
Reliable assessment of differential expression:Show: fold change found cannot be explained by
random variation
![Page 18: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/18.jpg)
Assessment of differential expression
Two interrelated aspects:
Fold change:How large is the expression difference found?
P-value:How sure are we that a true difference exists?
![Page 19: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/19.jpg)
LIMMA:Linear models for gene expression
![Page 20: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/20.jpg)
Modeling variation
How does gene expression depend on experimental conditions?
Can often be well modeled with linear models
Limma: linear models for microarray analysisGordon Smyth, W. and E. Hall Institute, Australia
![Page 21: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/21.jpg)
Multiplicative scale effects
Assumption: effects on gene expression work in a multiplicative way (“fold change”)
Example: treatment increases gene expression of gene MMP8 by a factor 2 “2-fold increase”
Treatment decreases gene expression of gene MMP8 by a factor 2“2-fold decrease”
![Page 22: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/22.jpg)
Multiplicative scale errors
Assumption: variation on gene expression works in a multiplicative way
A 2-fold increase by chance is just as likely as a 2-fold decrease by chance
When true expression is 4, measuring 8 is as likely as measuring 2
![Page 23: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/23.jpg)
Working on the log scale
When effects are multiplicative, log-transform!Usual in microarray analysis: log to base 2
Remember: log(ab) = log(a)+log(b)2 fold increase = +1 to log expression2 fold decrease = -1 to log expression
Log scale makes multiplicative effects symmetric½ and 2 are not symmetric around 1 (= no change)-1 and +1 are symmetric around 0 (= no change)
![Page 24: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/24.jpg)
A simple linear model
Example: treated and untreated samples
Model separately for each geneLog Expression of gene 1: E1
E1 = a + b * Treatment + error
a: intercept = average untreated logexpression b: slope = treatment effect
![Page 25: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/25.jpg)
Modeling all genes simultaneously
E1 = a1 + b1 * Treatment + errorE2 = a2 + b2 * Treatment + error…E20,000 = a20,000 + b20,000 * Treatment +
error
Same model, butSeparate intercept and slope for each geneAnd separate sd sigma1, sigma2, … of error
![Page 26: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/26.jpg)
Estimates and standard errors
Gene 1: Estimates for a1, b1 and sigma1Estimate of treatment effect of gene 1b1 is the estimated log fold changestandard error s.e.(b1) depends on sigma1
Regular t-test for H0: b1=0:T = b1/s.e.(b1) Can be used to calculate p-values.Just like regular regression, only 20,000 times
![Page 27: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/27.jpg)
Back to original scale
Log scale regression coefficient b1Average log fold change
Back to a fold change: 2^b1b1= 1 becomes fold change 2b1 = -1 becomes fold change 1/2
![Page 28: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/28.jpg)
Confounders
Other effects may influence gene expression
Example: batch effectsExample: sex or age of patients
In a linear model we can adjust for such confounders
![Page 29: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/29.jpg)
Flexibility of the linear model
Earlier: E1 = a1 + b1 * Treatment + error
Generalize:E1 = a1 + b1 * X + c1 * Y + d1 + Z + error
Add as many variables as you need.
![Page 30: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/30.jpg)
Variance shrinkage
![Page 31: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/31.jpg)
Empirical Bayes
So far: each gene on its own20,000 unrelated models
Limma: exchange information between genes
“Borrowing strength”By empirical Bayes arguments
![Page 32: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/32.jpg)
Estimating variance
For each gene a variance is estimated
Small sample size: variance estimate is unreliableToo small for some genesToo large for others
Variance estimated too small: false positivesVariance estimated too large: low power
![Page 33: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/33.jpg)
Large and small estimated variance
Gene with low variance estimateLikely to have low true varianceBut also: likely to have underestimated variance
Gene with high variance estimateLikely to have high true varianceBut also: likely to have overestimated variance
Limma’s idea:Use information from other genes to assess whether
variance is over/underestimated
![Page 34: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/34.jpg)
True and estimated variance
![Page 35: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/35.jpg)
Variance model
Limma has a gene variance modelAll gene’s variances are drawn at random
from an inverse gamma distribution
Based on this model:Large variances are shrunk downwardsSmall variances are shrunk upwards
![Page 36: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/36.jpg)
Effect of variance shrinkage
Genes with large fold change and large varianceMore powerMore likely to be significant
Genes with small fold change and small varianceLess powerLess likely to be significant
![Page 37: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/37.jpg)
Limma and sample size
Shrinkage of limma only effective for small sample size (< 10 samples/group)
Added information of other genes becomes negligeable if sample size gets large
Large samples: Doing limma is the same as doing regression per gene
![Page 38: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/38.jpg)
Differential expression in RNAseq
![Page 39: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/39.jpg)
RNAseq data: counts
Gene id Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
ENSG00000110514 69 178 101 58 101 31 165 108 70 1
ENSG00000086015 115 52 86 88 146 84 59 85 86 0
ENSG00000115808 285 190 467 295 345 532 369 473 423 5
ENSG00000169740 502 184 363 195 403 262 225 332 136 3
ENSG00000215869 0 7 0 0 0 0 0 2 0 0
ENSG00000261609 20 31 76 20 25 158 23 18 23 1
ENSG00000169744 488 529 470 505 1137 373 1392 3517 192 1
ENSG00000215864 1 0 0 0 0 0 0 0 0 0
![Page 40: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/40.jpg)
Modelling count data
Distinguish three types of variationBiological variationTechnical variationCount variation
Count variation is important for low-expressed genes
Generally biological variation most important
![Page 41: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/41.jpg)
Overdispersion
Modelling count data: two stages
1.Model how gene expression varies from sample to sample
2.Model how the observed count varies by repeated sequencing of the same sample
Stage 2 is specific for RNAseq
![Page 42: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/42.jpg)
Two approaches
Approach 1: Model the count variation and the between-sample variationedgeRDeseq
Approach 2: Normalize the count data and model only the biological variationVoom + limma
Approach 3: Model count variation onlyPopular but very wrong!
![Page 43: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/43.jpg)
Multiple testing
![Page 44: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/44.jpg)
20,000 p-values
Fitting 20,000 linear modelsSome variance shrinkage
Result:20,000 fold changes20,000 p-values
Which ones are truly differentially expressed?
![Page 45: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/45.jpg)
Multiple testing
Doing 20,000 tests: risk false positive 20,000 times
If 5% of null hypotheses is significant, expect 1,000 significant by pure chance
How to make sure you can really trust the results?
![Page 46: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/46.jpg)
Bonferroni
Classical way of doing multiple testingCall K the number of tests performed
Bonferroni: significant = p-value < 0.05/K
“Adjusted p-value”Multiply all p-values by K, compare with 0.05
![Page 47: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/47.jpg)
Advantages of Bonferroni
Familywise error control=Probability of making any type I error < 0.05
With 95% chance, list of differentially expressed genes has no errors
Very strictEasy to do
![Page 48: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/48.jpg)
Disadvantages of Bonferroni
Very strict“No” false positivesMany false negatives
It is not a big problem to have a few false positives
Do validation experiments later
![Page 49: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/49.jpg)
False discovery rate (Benjamini and Hochberg)
FDR = expected proportion of false discoveries among all discoveries
Control of FDR at 0.05 means in the long run experiments average about 5% type I errors among the reported genes
Percentage: longer lists of genes are allowed to have more errors
![Page 50: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/50.jpg)
Benjamini and Hochberg by hand
1. Order the p-values small to largeExample: 0.0031, 0.0034, 0.02, 0.10, 0.652. Multiply the k-th p-value by m/k, where m is the number
of p-values, so0.0031 * 5/1, 0.0034 * 5/2, 0.02 * 5/3, 0.10 * 5/4, 0.65 * 5/5which becomes0.0155, 0.0085, 0.033, 0.125, 0.653. If the p-values are no longer in increasing order, replace
each p-value by the smallest p-value that is later in the list. In the example, we replace 0.0155 by 0.0085. The final Benjamini-Hochberg adjusted p-values become
0.0085, 0.0085, 0.033, 0.125, 0.65
![Page 51: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/51.jpg)
FDR warnings
FDR is susceptible to cheating
How to cheat with FDR?Add many tests of known false null
hypotheses…
Result: reject more of the other null hypotheses
![Page 52: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/52.jpg)
Example limma results
![Page 53: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/53.jpg)
Conclusion
![Page 54: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman](https://reader035.vdocument.in/reader035/viewer/2022062309/56649f045503460f94c17cdd/html5/thumbnails/54.jpg)
Testing for differentially expressed genes
Repeated application of a linear model
Include all factors in the model that may influence gene expression
Limma: additional step “borrowing strength”
Don’t forget to correct for multiple testing!