essential statistics in biology: getting the numbers right

Essential Statistics in

Biology: Getting the Numbers

Right

Raphael GottardoClinical Research Institute of Montreal (IRCM)

[email protected]://www.rglab.org

mailto:[email protected]

http://www.rglab.org/

Day 1 2

Outline

•Exploratory Data Analysis

•1-2 sample t-tests, multiple testing

•Clustering

•SVD/PCA

•Frequentists vs. Bayesians

PCA and SVD(Multivariate

analysis)

Day 1 - Section 4 4

Outline

•What is SVD? Mathematical definition

•Relation to Principal Component Analysis (PCA)

•Applications of PCA and SVD

•Illustration with gene expression data

Day 1 - Section 4 5

SVDLet X be a matrix of size mxn (m≥n) and rank r≤nthen we can decompose X as

XXVVSS

UU= x x T

m

n

m n

n n n

n

- U is the matrix of left singular vectors- V is the matrix of right singular vectors- S is a diagonal matrix who’s diagonal are the singular values

Day 1 - Section 4 6


XXVVSS

UU= x x T

m

n

m n

n n n

n

Day 1 - Section 4 7


XXVVSS

UU= x x T

m

n

m n

n n n

n

DirectionAmplitude

Day 1 - Section 4 8

Relation to PCA

Assume that the rows of X are centered then is (up to a constant) the empirical covariance matrix and SVD is equivalent to PCA

The rows of V are the singular vectors or principal components

New variabl

esVarianc

e

Gene expression: Eigengenes or eigenassays

Day 1 - Section 4 9

Applications of SVD and PCA•Dimension reduction (simplify a dataset)

•Clustering

•Discriminant analysis

•Exploratory data analysis tool

•Find the most important signal in data

•2D projections

Day 1 - Section 4 10

Toy examples=(13.47,1.45)set.seed(100)

x1<-rnorm(100,0,1)y1<-rnorm(100,1,1)

var0.5<-matrix(c(1,-.5,-.5,.1),2,2)

data1<-t(var0.5%*%t(cbind(x1,y1)))

set.seed(100)x2<-rnorm(100,2,1)y2<-rnorm(100,2,1)

var0.5<-matrix(c(1,.5,.5,1),2,2)


data<-rbind(data1,data2)

svd1<-svd(data1)plot(data1,xlab="x",ylab="y",xlim=c(-6,6),ylim=c(-6,6))abline(coef=c(0,svd1$v[2,1]/svd1$v[1,1]),col=2)abline(coef=c(0,svd1$v[2,2]/svd1$v[1,2]),col=3)


Toy examples=(47.79,13.25)svd2<-svd(data2)

plot(data2,xlab="x",ylab="y",xlim=c(-6,6),ylim=c(-6,6))abline(coef=c(0,svd2$v[2,1]/svd2$v[1,1]),col=2)abline(coef=c(0,svd2$v[2,2]/svd2$v[1,2]),col=3)

svd<-svd(data)

plot(data,xlab="x",ylab="y",xlim=c(-6,6),ylim=c(-6,6))abline(coef=c(0,svd$v[2,1]/svd$v[1,1]),col=2)abline(coef=c(0,svd$v[2,2]/svd$v[1,2]),col=3)


Toy example### Projectiondata.proj<-svd$u%*%diag(svd$d)svd.proj<-svd(data.proj)

plot(data.proj,xlab="x",ylab="y",xlim=c(-6,6),ylim=c(-6,6))abline(coef=c(0,svd.proj$v[2,1]/svd.proj$v[1,1]),col=2)### svd.proj$v[1,2]=0abline(v=0,col=3)


Toy examples=(47.17,11.88)

Newcoordina

tes

Projecteddata


Toy example### New data

set.seed(100)x1<-rnorm(100,-1,1)y1<-rnorm(100,1,1)

var0.5<-matrix(c(1,-.5,-.5,1),2,2)


set.seed(100)x2<-rnorm(100,1,1)y2<-rnorm(100,1,1)

var0.5<-matrix(c(1,.5,.5,1),2,2)


data<-rbind(data1,data2)

svd1<-svd(data1)plot(data1,xlab="x",ylab="y",xlim=c(-

6,6),ylim=c(-6,6))

abline(coef=c(0,svd1$v[2,1]/svd1$v[1,1]),col=2)


svd2<-svd(data2)plot(data2,xlab="x",ylab="y",xlim=c(-

6,6),ylim=c(-6,6))



svd<-svd(data)

plot(data,xlab="x",ylab="y",xlim=c(-6,6),ylim=c(-6,6))

abline(coef=c(0,svd$v[2,1]/svd$v[1,1]),col=2)

abline(coef=c(0,svd$v[2,2]/svd$v[1,2]),col=3)


Toy examples=(26.48,24.98)


Application to microarrays•Dimension reduction (simplify a dataset)

•Clustering (two many samples)

•Discriminant analysis (find a group of genes)

•Exploratory data analysis tool

•Find the most important signal in data

•2D projections (clusters?)


Application to microarrays

Cho cell cycle data set384 genes

We have standardized the datacho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[,3:19])

cho.mean<-apply(cho.data,1,"mean")cho.sd<-apply(cho.data,1,"sd")cho.data.std<-(cho.data-cho.mean)/cho.sd

svd.cho<-svd(cho.data.std)### Contribution of each PCbarplot(svd.cho$d/sum(svd.cho$d),col=heat.colors(17))### First three singular vectors (PCA)plot(svd.cho$v[,1],xlab="time",ylab="Expression profile",type="b")plot(svd.cho$v[,2],xlab="time",ylab="Expression profile",type="b")plot(svd.cho$v[,3],xlab="time",ylab="Expression profile",type="b")

### Projectionplot(svd.cho$u[,1]*svd.cho$d[1],svd.cho$u[,2]*svd.cho$d[2],xlab="PCA 1 ",ylab="PCA 2")plot(svd.cho$u[,1]*svd.cho$d[1],svd.cho$u[,3]*svd.cho$d[3],xlab="PCA 1 ",ylab="PCA 3")plot(svd.cho$u[,2]*svd.cho$d[2],svd.cho$u[,3]*svd.cho$d[3],xlab="PCA 2 ",ylab="PCA 3")

### Select a clusterind<-(svd.cho$u[,2]*svd.cho$d[2])^2+(svd.cho$u[,3]*svd.cho$d[3])^2>5 & svd.cho$u[,2]*svd.cho$d[2]>0 & svd.cho$u[,3]*svd.cho$d[3]<0

plot(svd.cho$u[,2]*svd.cho$d[2],svd.cho$u[,3]*svd.cho$d[3],xlab="PCA 2 ",ylab="PCA 3")points(svd.cho$u[ind,2]*svd.cho$d[2],svd.cho$u[ind,3]*svd.cho$d[3],col=2)

matplot(t(cho.data.std[ind,]),xlab="time",ylab="Expression profiles",type="l")

http://faculty.washington.edu/kayee/cluster/logcho_237_4class.txt


Application to microarrays

Singular values

Relativecontribution

Why?

Main contribution


Application to microarraysPC1


Application to microarraysProjection

onto PC1 PC2



onto PC1 PC3



onto PC2 PC3



onto PC2 PC3

24 genes


Conclusion

•SVD is a powerful tool

•Can be very useful in gene expression data

•SVD of genes (eigen-genes)

•SVD of samples (eigen-assays)

•Mostly an EDA tool

Overview of Statistics

inference: Bayes vs. Frequentists

(If time permits)


Introduction

•Parametric statistical model

•Observation are drawn from a probability distribution where is the parameter vectorLikelihood function →

(Inverted density)


Introduction


•Observation are drawn from a probability distribution where is the parameter vectorLikelihood function →

(Inverted density)


Introduction

Normal distributionProbability distribution for one observation is

If independence


Introduction15 observations

N(1,1)



N(1,1)

True probability distribution


Inference

•The parameters are unknown

•“Learn” something about the parameter vector θ from the data

•Make inference about θ

‣ Estimate θ

‣ Confidence region

‣ Test an hypothesis (θ=0)


The frequentist approach

•The parameters are fixed but unknown

•Inference is based on the relative frequency of occurrence when repeating the experiment

•For example, one can look at the variance of an estimator to evaluate its efficiency


The Normal Example: Estimation

Normal distribution

is the mean and is the variance

(Sample mean and sample variance)

Numerical example, 15 obs. from N(1,1)

Use the theory of repeated samples to evaluatethe estimators.


The Normal Example: EstimationIn our toy example, the data are normal, and we can derive the sampling distribution of the estimators.For example we know that is normal with mean and variance . The standard deviation of an estimator is called the standard error. What if we can’t derive the sampling distribution?Use the bootstrap!


The Bootstrap- Basic idea is to resample the data we have observed and compute a new value of the statistic/estimator for each resampled data set.- Then one can assess the estimator by looking at the empirical distribution across the resampled data sets.

set.seed(100)x<-rnorm(15)mu.hat<-mean(x)sigma.hat<-sd(x)B<-100mu.hatNew<-rep(0,B)for(i in 1:B){ x.new<-sample(x,replace=TRUE) mu.hatNew[i]<-mean(x.new)}se<-sd(mu.hatNew)set.seed(100)x<-rnorm(15)mu.hat<-mean(x)sigma.hat<-sd(x)B<-100mu.hatNew<-rep(0,B)for(i in 1:B){ x.new<-sample(x,replace=TRUE) mu.hatNew[i]<-median(x.new)}se<-sd(mu.hatNew)


The Normal Example: CIConfidence interval for

the mean :

depends on n but when n is large

and usuallywhere

Numerical example, 15 obs. from N(1,1)

What does this mean?set.seed(100)x<-rnorm(15)t.test(x,mean=0)

> set.seed(100)> x<-rnorm(15)> t.test(x,mean=0)

One Sample t-test

data: x t = 0.3487, df = 14, p-value = 0.7325alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.2294725 0.3185625 sample estimates:mean of x 0.044545


The Normal Example:Testing

Test an hypothesis about the mean:

t-test

If , t follows a t-distribution with n-1 degrees of freedom

p-value


The Bayesian Approach


•Observation are drawn from a probability distribution where is the parameter vector

● The parameters are unknown but random● The uncertainty on the vector parameter is model through a prior distribution



A Bayesian statistical model is made of

1. A parametric statistical model

2. A prior distribution

Q: How can we combine the two?A: Bayes Theorem!


The Bayesian ApproachBayes theorem ↔ Inversion of probability

If A and E are events such that P(E)≠0 and P(A)≠0 then P(A|E) and P(E|A) are related by


The Bayesian ApproachFrom prior to posterior:

Information on Information on θθ contained in the contained in the observation observation yy

Prior informationPrior information

Normalizing constant


The Bayesian ApproachSequential nature of Bayes’ theorem:

The posterior is the new prior!



•Actualization of the information about θ by extracting the information about θ from the data

• Condition upon the observations (Likelihood principle)

•Avoids averaging over the unobserved values of y

•Provide a complete unified inferential scope

Justifications:



•Calculation of the normalizing constant can be difficult

•Conjugate priors (exact calculation is possible)

•Markov chain Monte Carlo

Practical aspect:



Conjugate priors:

Example:

and

+ →

Normal mean, one observation



Conjugate priors:

Example:

and

+ →

Normal mean, n observations

Shrinkage



N(1,1)Standardized

likelihood



N(1,1)Standardized

likelihood

Prior



N(1,1)Standardized

likelihood

Prior

Posterior



N(1,1)Standardized

likelihood

Prior



N(1,1)Standardized

likelihood

Prior

Posterior



•Many!

•Subjectivity of the prior (most critical)

•The prior distribution is the key to Bayesian inference

Criticism of the Bayesian choice:



•Prior information is (almost) always available

•There is no such things as a prior distribution

•The prior is a tool summarizing available information as well as uncertainty related with this information

• The use of your prior is ok as long as you can justify it

Response:



•Make the best of available prior information

•Unified framework

•The prior information can be used to regularize noisy estimates (few replicates)

•Computationally demanding?

Bayesian statistics and Bioinformatics

essential statistics in biology: getting the numbers right

Documents

x asxvsu

rows of x

matrix of size mxn mn

seed100x1toy exampleday

projtoy exampleday

25svd2toy exampleday

rank rnthen

singular valuessvdday