lecture 8 2014 quality control of high throughput …...this is called simpson's paradox, and...

Lecture 8 2014

Quality control of high throughput biological data

and Statistical testing for large biological data

Anja Bråthen Kristoffersen

Introduction

• There are many sources of variability and bias in high-

throughput biological experiments.

• Make it difficult to distinguish between biological differences

and experimental noise.

• Raw data can be very misleading.

• We will look at design, transformation and normalization

methods to reduce technical noise.

Statistical bioinformatics 3

Randomization of experiments

• Ensure that you will not have any systematic biases:

– Distribute the biological groups in a balanced way

– Divide into batches of the same size, limited by the

capacity on each step

• Randomize and balance according to the biology

that you are interested in

– Make random numbers by using the funciton sample() in R

– E.g. draw 10 numbers between 1 and 10 without

replacment:

> sample(10,10, replace = F)


Experimental plan: an example

11. januar 2014 Statistical bioinformatics 5

Samples color coded according to biology


Samples color coded according to labeling date


Precautions

• Experimental methods should be standardized

across the same experiment

– ideally across all experiments

• Multiple biological replicates make it possible to

account for individual variability.

• If possible, multiple technical replicates

– Partition the same sample into multiple runs or even

multiple machines

• In the end, the data should be precise, accurate,

and directly comparable to other data. Ny Powerpoint mal 2011 8

Statistical goals in quality control analysis

• Examine distributional properties of data and to

assess their quality

– Goal 1: to examine whether the data are appropriate

for any subsequent analysis outlier detection

– Goal 2: to investigate the variability and relationships

among different samples and replicates


The goal

• Most of statistical analysis rely on well-behaved

distributions.

– Skewed distribution data transformation

– Heterogeneous variance variance-stabilizing

transformation

• e.g. power transformations

– Outliers and noise robust statistics, e.g.

• median is robust while mean is not

– Data from different experiments should be comparable

• Data normalization


Example: effect of outliers


A random set of 10 points: one more point added:

Even one outlier can change your whole idea about data, if you are

not carefull!

Motivation for Normalization

• Assume you do an experiment and find a negative correlation. Then

you combine this with the results from your colleague, who used a

different reference:

Your results Combined results


This is called Simpson's Paradox, and it can ruin your whole day!

Descriptive Statistics - Box-plot


-2-1

01

2

75% quantile

25% quantile

Median

IQR

1.5xIQR

1.5xIQR

Everything above

or below are

considered outliers

IQR= 75% quantile -25% quantile= Inter Quantile Range

x <- rnorm(100, mean=0, sd=1)

boxplot(x)

Transformations (log)

𝑦 = 𝑒𝑥, 𝑥 = ln(𝑦)

𝑦 = 10𝑥, 𝑥 = log 10 𝑦

𝑦 = 2𝑥, 𝑥 = log 2 𝑦

• cannot handle negative values

• minimize the impact of extreme values

• log2 transformation helps you easily identify

doublings or halvings in ratios.


Biexponential transformation

• Arcsinh

𝑦 =𝑒𝑥−𝑒−𝑥

2, 𝑥 = 𝑙𝑛 𝑦 + 𝑦2 + 1

• Logicle

𝑦 = 𝑎𝑒𝑏(𝑥−𝑤) − 𝑐𝑒−𝑑 𝑥−𝑤 + 𝑓

• Logicle transform is similar to arcsinh but with

more parameters in the transformation.


Example: histogram and boxplot


log - transformation

• Data is highly skewed (positively)

– Lots of small values with a few very large values.

• Need to transform this into a well-behaved

distribution.

– Ideally something like a Gaussian.

• Log transformation is generally used for positively

skewed data.

– Use log2(X)

– More intuitive, each whole number is a twofold change

(+1 → * 2)


Example: histogram and boxplot, after log transformation


QQ-plot

• The QQ-plot shows the theoretical quantiles

versus the empirical quantiles. If the distribution

assumed (theoretical one) is indeed the correct

one, we should observe a straight line with

gradient equal to 1.


QQ-plot


DLBCLpatientDataNEW.txt

• http://llmpp.nih.gov/DLBCL/

• Is already normalized

Ny Powerpoint mal 2011 21

http://llmpp.nih.gov/DLBCL/

summary(dat[,7:12])


boxplot(dat[,7:12])


x <- c(dat[,7], dat[,8], dat[,9], dat[,10], dat[,11], dat[,12])

hist(x)


qqnorm(x)

qqline(x)


Data Normalization

• Normalization allows us to handle several datasets of

different origin and use them together.

– Remember Simpson's Paradox!

• There are several standard methods:

– Shifting Add a constant to all data points, shifting the mean.

• Called centering if the constant added is - µ

– Scaling Multiply data points with a scaling factor based on some

reference mean, xref .

𝑥′𝑖𝑗 = 𝑥𝑖𝑗𝑥𝑟𝑒𝑓

𝑥𝑗

– Quantile Normalization Match quantiles of two distributions


Quantile normalization

If you have a reference distribution:

• Sort your data.

• For any value in your data, find its rank among all other

data points, and calculate the probability that X < x:

𝑃 𝑋 < 𝑥 = 1 −𝑟𝑎𝑛𝑘(𝑥)

𝑛

• Lookup the value for that probability in the reference

cumulative density distribution (CDF).

• Replace your value with the reference value at the

same quantile.


Mean average plot (MA plot)

• XY scatter plot often leads to seeing biased error patterns

• Mathematical bias when a regression-based normalization

used

• MA transformation: A = (X1+X2)/2 and M = X1-X2


MA plot

• The MA plot in the example shows bias.

• Typically, you want a distribution centered on A=0.


MA plot, baseline correction

• The distribution can be corrected by finding and removing

the baseline of the MA plot.

– Locally weighted scatterplot smoothing (LOESS)

– Problem: intensity values are nonlinear transformed after

normalization, so linear relationship such as fold change are not

completely conserved.


Statistical testing and large datasets


Sensitivity, specificity, FPR, FNR and FDR

Test result

Disease Negative (testedN) Positive (testedP)

Negative (N) Correct False positive (FP)

(type I error)

Positive (P) False negative (FN)

(type II error)

Correct


Falsepositiverate = 𝐹𝑃𝑅 = 𝐹𝑃

𝑁= 1 − specificity

Falsenegativerate = 𝐹𝑁𝑅 = 𝐹𝑁

𝑃= 1 − sensitivity

Falsediscoveryrate = 𝐹𝐷𝑅 =𝐹𝑃

𝑡𝑒𝑠𝑡𝑒𝑑𝑃

Plot for FPR vs.1-FNR of a statistical test

• Need to know the true positive and true

negative. Easily done in R using the package

ROCR

install.packages("ROCR")

library(ROCR)

pred <- prediction(pvalue, truePN)

perf <- performance(pred,"sens","spec")

plot(perf)

#pvalue and truePN is vectors of

#similar length where truePN is the true

#positive or negative value while pvalue

#is a calculated pvalue for the

#datapoint to be positive or negative

Receiver Operator Characteristic (ROC) curve


Multiple hypothesis testing

• Tests are designed such that it has an expected

proportion of incorrectly rejected null hypotheses,

most often this level is 5%.

• When many tests are done the probability of

rejecting a null hypotheses falsely increase,

hence we can correct the probabilities according

to how many tests that are done.

35

• Q: is gene g, g = 1, …, 10 000, differentially

expressed?

• Gives 10 000 null hypothesis: 𝐻01, 𝐻0

2, … ,𝐻010000

– 𝐻01: gene 1 not differentially expressed

– …

• Assume: no genes differentially expressed

– 𝐻0𝑔true for all g

• Significance level α ≤ 0.01

– The probability to incorrectly conclude that one gene is

differentially expressed is 0.01. e.g. 0.01 * 10000 = 100

expected wrong rejections of 𝐻0𝑔

Example 10000 genes

36

Need to control the risk of false positive

Type I error

• Corrected p-value:

– The original p-values do not tell the full story.

– Instead of using the original p-values for decision

making, we should use corrected ones.

37

Different correction methods

• Bonferroni (1935)

– Just multiply all the p-values by the number of tests

– To conservative • need very small p-value to reject 𝐻0

• giveverylittlepower

• Methods that control the family-wise error rate

(FWER).

• Methods that control the false discovery rate

(FDR).

38

Family-Wise Error Rate (FWER)

• Control type I errors at a level α: Pr(FP ≥ 1) < α

– Control the probability of making any false positive call at

the desired significance level

– Conservative methods such as Bonferroni correction

• Divide p-value by number of tests done (e.g. genes)

– Other less conservative but similar methods are:

• Sidak

• Bonferroni-Holm

• Westfall & Young

• Use one of these if you are most afraid of getting

stuff on your significant list that should not have

been there 39

False Discovery Rate (FDR)

• Calculate the expected proportion of type I error

among the rejected hypotheses: – E(FDR) = E(#false positive prediction/#total positive predictions)

• Control the prorortion of false positive calls in all

positive calls at the desired significance level

• Technique that applies to a set of p-values

– Benjamini & Hochberg

– Different newer variants of Benjamini & Hochberg

• Use one of these if you are you most afraid of

missing out on interesting stuff 40

help(p.adjust)

41

False discovery rate (FDR)

2014.03.05 42

lecture 8 2014 quality control of high throughput …...this is called simpson's paradox, and...

Documents