lecture 8 2014 quality control of high throughput …...this is called simpson's paradox, and...
TRANSCRIPT
Lecture 8 2014
Quality control of high throughput biological data
and Statistical testing for large biological data
Anja Bråthen Kristoffersen
Introduction
• There are many sources of variability and bias in high-
throughput biological experiments.
• Make it difficult to distinguish between biological differences
and experimental noise.
• Raw data can be very misleading.
• We will look at design, transformation and normalization
methods to reduce technical noise.
Statistical bioinformatics 3
Randomization of experiments
• Ensure that you will not have any systematic biases:
– Distribute the biological groups in a balanced way
– Divide into batches of the same size, limited by the
capacity on each step
• Randomize and balance according to the biology
that you are interested in
– Make random numbers by using the funciton sample() in R
– E.g. draw 10 numbers between 1 and 10 without
replacment:
> sample(10,10, replace = F)
Statistical bioinformatics 4
Experimental plan: an example
11. januar 2014 Statistical bioinformatics 5
Samples color coded according to biology
11. januar 2014 Statistical bioinformatics 6
Samples color coded according to labeling date
11. januar 2014 Statistical bioinformatics 7
Precautions
• Experimental methods should be standardized
across the same experiment
– ideally across all experiments
• Multiple biological replicates make it possible to
account for individual variability.
• If possible, multiple technical replicates
– Partition the same sample into multiple runs or even
multiple machines
• In the end, the data should be precise, accurate,
and directly comparable to other data. Ny Powerpoint mal 2011 8
Statistical goals in quality control analysis
• Examine distributional properties of data and to
assess their quality
– Goal 1: to examine whether the data are appropriate
for any subsequent analysis outlier detection
– Goal 2: to investigate the variability and relationships
among different samples and replicates
Statistical bioinformatics 9
The goal
• Most of statistical analysis rely on well-behaved
distributions.
– Skewed distribution data transformation
– Heterogeneous variance variance-stabilizing
transformation
• e.g. power transformations
– Outliers and noise robust statistics, e.g.
• median is robust while mean is not
– Data from different experiments should be comparable
• Data normalization
Statistical bioinformatics 10
Example: effect of outliers
Statistical bioinformatics 11
A random set of 10 points: one more point added:
Even one outlier can change your whole idea about data, if you are
not carefull!
Motivation for Normalization
• Assume you do an experiment and find a negative correlation. Then
you combine this with the results from your colleague, who used a
different reference:
Your results Combined results
Statistical bioinformatics 12
This is called Simpson's Paradox, and it can ruin your whole day!
Descriptive Statistics - Box-plot
Statistical bioinformatics 13
-2-1
01
2
75% quantile
25% quantile
Median
IQR
1.5xIQR
1.5xIQR
Everything above
or below are
considered outliers
IQR= 75% quantile -25% quantile= Inter Quantile Range
x <- rnorm(100, mean=0, sd=1)
boxplot(x)
Transformations (log)
𝑦 = 𝑒𝑥, 𝑥 = ln(𝑦)
𝑦 = 10𝑥, 𝑥 = log 10 𝑦
𝑦 = 2𝑥, 𝑥 = log 2 𝑦
• cannot handle negative values
• minimize the impact of extreme values
• log2 transformation helps you easily identify
doublings or halvings in ratios.
Statistical bioinformatics 14
Biexponential transformation
• Arcsinh
𝑦 =𝑒𝑥−𝑒−𝑥
2, 𝑥 = 𝑙𝑛 𝑦 + 𝑦2 + 1
• Logicle
𝑦 = 𝑎𝑒𝑏(𝑥−𝑤) − 𝑐𝑒−𝑑 𝑥−𝑤 + 𝑓
• Logicle transform is similar to arcsinh but with
more parameters in the transformation.
Statistical bioinformatics 15
Example: histogram and boxplot
11. januar 2014 Statistical bioinformatics 16
log - transformation
• Data is highly skewed (positively)
– Lots of small values with a few very large values.
• Need to transform this into a well-behaved
distribution.
– Ideally something like a Gaussian.
• Log transformation is generally used for positively
skewed data.
– Use log2(X)
– More intuitive, each whole number is a twofold change
(+1 → * 2)
Statistical bioinformatics 17
Example: histogram and boxplot, after log transformation
11. januar 2014 Statistical bioinformatics 18
QQ-plot
• The QQ-plot shows the theoretical quantiles
versus the empirical quantiles. If the distribution
assumed (theoretical one) is indeed the correct
one, we should observe a straight line with
gradient equal to 1.
Statistical bioinformatics 19
QQ-plot
11. januar 2014 Statistical bioinformatics 20
DLBCLpatientDataNEW.txt
• http://llmpp.nih.gov/DLBCL/
• Is already normalized
Ny Powerpoint mal 2011 21
summary(dat[,7:12])
Statistical bioinformatics 22
boxplot(dat[,7:12])
Statistical bioinformatics 23
x <- c(dat[,7], dat[,8], dat[,9], dat[,10], dat[,11], dat[,12])
hist(x)
Ny Powerpoint mal 2011 24
qqnorm(x)
qqline(x)
Ny Powerpoint mal 2011 25
Data Normalization
• Normalization allows us to handle several datasets of
different origin and use them together.
– Remember Simpson's Paradox!
• There are several standard methods:
– Shifting Add a constant to all data points, shifting the mean.
• Called centering if the constant added is - µ
– Scaling Multiply data points with a scaling factor based on some
reference mean, xref .
𝑥′𝑖𝑗 = 𝑥𝑖𝑗𝑥𝑟𝑒𝑓
𝑥𝑗
– Quantile Normalization Match quantiles of two distributions
Statistical bioinformatics 26
Quantile normalization
If you have a reference distribution:
• Sort your data.
• For any value in your data, find its rank among all other
data points, and calculate the probability that X < x:
𝑃 𝑋 < 𝑥 = 1 −𝑟𝑎𝑛𝑘(𝑥)
𝑛
• Lookup the value for that probability in the reference
cumulative density distribution (CDF).
• Replace your value with the reference value at the
same quantile.
Statistical bioinformatics 27
Statistical bioinformatics 28
Mean average plot (MA plot)
• XY scatter plot often leads to seeing biased error patterns
• Mathematical bias when a regression-based normalization
used
• MA transformation: A = (X1+X2)/2 and M = X1-X2
11. januar 2014 Statistical bioinformatics 29
MA plot
• The MA plot in the example shows bias.
• Typically, you want a distribution centered on A=0.
11. januar 2014 Statistical bioinformatics 30
MA plot, baseline correction
• The distribution can be corrected by finding and removing
the baseline of the MA plot.
– Locally weighted scatterplot smoothing (LOESS)
– Problem: intensity values are nonlinear transformed after
normalization, so linear relationship such as fold change are not
completely conserved.
11. januar 2014 Statistical bioinformatics 31
Statistical testing and large datasets
Statistical bioinformatics 32
Sensitivity, specificity, FPR, FNR and FDR
Test result
Disease Negative (testedN) Positive (testedP)
Negative (N) Correct False positive (FP)
(type I error)
Positive (P) False negative (FN)
(type II error)
Correct
Statistical bioinformatics 33
Falsepositiverate = 𝐹𝑃𝑅 = 𝐹𝑃
𝑁= 1 − specificity
Falsenegativerate = 𝐹𝑁𝑅 = 𝐹𝑁
𝑃= 1 − sensitivity
Falsediscoveryrate = 𝐹𝐷𝑅 =𝐹𝑃
𝑡𝑒𝑠𝑡𝑒𝑑𝑃
Plot for FPR vs.1-FNR of a statistical test
• Need to know the true positive and true
negative. Easily done in R using the package
ROCR
install.packages("ROCR")
library(ROCR)
pred <- prediction(pvalue, truePN)
perf <- performance(pred,"sens","spec")
plot(perf)
#pvalue and truePN is vectors of
#similar length where truePN is the true
#positive or negative value while pvalue
#is a calculated pvalue for the
#datapoint to be positive or negative
Receiver Operator Characteristic (ROC) curve
Ny Powerpoint mal 2011 34
Multiple hypothesis testing
• Tests are designed such that it has an expected
proportion of incorrectly rejected null hypotheses,
most often this level is 5%.
• When many tests are done the probability of
rejecting a null hypotheses falsely increase,
hence we can correct the probabilities according
to how many tests that are done.
35
• Q: is gene g, g = 1, …, 10 000, differentially
expressed?
• Gives 10 000 null hypothesis: 𝐻01, 𝐻0
2, … ,𝐻010000
– 𝐻01: gene 1 not differentially expressed
– …
• Assume: no genes differentially expressed
– 𝐻0𝑔true for all g
• Significance level α ≤ 0.01
– The probability to incorrectly conclude that one gene is
differentially expressed is 0.01. e.g. 0.01 * 10000 = 100
expected wrong rejections of 𝐻0𝑔
Example 10000 genes
36
Need to control the risk of false positive
Type I error
• Corrected p-value:
– The original p-values do not tell the full story.
– Instead of using the original p-values for decision
making, we should use corrected ones.
37
Different correction methods
• Bonferroni (1935)
– Just multiply all the p-values by the number of tests
– To conservative • need very small p-value to reject 𝐻0
• giveverylittlepower
• Methods that control the family-wise error rate
(FWER).
• Methods that control the false discovery rate
(FDR).
38
Family-Wise Error Rate (FWER)
• Control type I errors at a level α: Pr(FP ≥ 1) < α
– Control the probability of making any false positive call at
the desired significance level
– Conservative methods such as Bonferroni correction
• Divide p-value by number of tests done (e.g. genes)
– Other less conservative but similar methods are:
• Sidak
• Bonferroni-Holm
• Westfall & Young
• Use one of these if you are most afraid of getting
stuff on your significant list that should not have
been there 39
False Discovery Rate (FDR)
• Calculate the expected proportion of type I error
among the rejected hypotheses: – E(FDR) = E(#false positive prediction/#total positive predictions)
• Control the prorortion of false positive calls in all
positive calls at the desired significance level
• Technique that applies to a set of p-values
– Benjamini & Hochberg
– Different newer variants of Benjamini & Hochberg
• Use one of these if you are you most afraid of
missing out on interesting stuff 40
help(p.adjust)
41
False discovery rate (FDR)
2014.03.05 42