computational genomics 10810 guy zinmanepxing/class/10810/lecture/recitation7.pdf · cdnaand...

Computational Genomics

Microarray normalization

Computational Genomics

10810

Guy Zinman

Different types of microarrays

cDNA spotted arrays

Method of generating

# of samples hybridized

Single-channel Dual-channel

cDNA

Library cDNA spotted arrays

(Stanford)

Affymetrix (25-mer)

NimbleGen (50-75-mer)

Agilent (60-mer)

Illumina (50-mer)

NimbleGen

Agilent

Illumina

Method of generating

probes cDNA

Library

Synthesized

cDNA microarraysKnown DNA sequences

Isolate mRNA

Cells of Interest

Glass slide

Reference sample

Flourescently

labeled

cDNA microarrays� cDNA microarray experiments typically involve hybridisingtwo mRNA samples, each of which has been converted into cDNA and labelled with its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and a green-fluorescent dye, Cyanine 3 (Cy3)), on a single glass slide that has been spotted with (several thousands of) cDNA probes.spotted with (several thousands of) cDNA probes.

� Because of competitive binding between the two samples, the ratio of the red and green fluorescence intensities for each spot is indicative of the relative abundance of the corresponding DNA probe in the two samples. Thus data from cDNA microarrays only provide information on the relative expression of the genes.

ScanningHybridized Microarray

Laser 2Laser 1Excitation

Monochrome pictures combined

Emission

(two-color arrays)

Image SegmentationScanned Image

Numerical Data

Segmentation Software

Oligonucleotide (synthesized) arrays

� The probes are short sequences designed to match parts of the sequence of known or predicted open reading frames.

� Oligonucleotide arrays are produced by printing short oligonucleotide sequences designed to represent a single gene or family of gene splice-variants by synthesizing this gene or family of gene splice-variants by synthesizing this sequence directly onto the array surface.


� Instead of one or several cDNA clones per gene, an oligo-microarray contains two times a set of probe pairs for each gene.

� One set consists of ‘perfect match’ (PM) oligonucleotidesthat are designed on non-conserved regions in a gene. that are designed on non-conserved regions in a gene.

� The other sets contains ‘mismatch oligonucleotides’ (MM) that are identical to the set of ‘perfect match’ oligonucleotides except for the nucleotide in the middle of the sequence.


� In the presence of a specific RNA in the hybridization solution, the perfect match probes will hybridize more strongly on average than their mismatch partners.

� This assumption is used to determine the presence or absence of this specific RNA. this specific RNA.

� For each probeset, the value that is usually taken as representative for the expression level of the corresponding gene (the quantitative RNA abundance), is the average difference between the set of perfect match probes minus the set of mismatch probes.

� In addition, the mismatched oligonucleotides are used to calculate cross-hybridisation and local background signals.

Other applications of microarray technology(besides measuring gene expression)

� DNA copy number analysis� SNP analysis

� chIP-chip (interaction data)� chIP-chip (interaction data)� Competitive growth assays� …

Sample Preparation

Hybridization

Array design

Probe design

Question

Experimental Design

Buy Chip/Array

Expression Index

Normalization

Image analysis

The DNA Array Analysis Pipeline

Statistical Analysis

Fit to Model (time series)

Expression Index

Calculation

Advanced Data AnalysisClustering PCA Classification Promoter Analysis

Meta analysis Survival analysis Regulatory Network

Comparable

Gene Expression Data

Technique Choices Aim (Real) Aim (Ideal)

Transformation

“To Near Normality”

Log2

Lin-Log

Numerically tractable

Gaussian

Data Normalisation

Normalisation

“Location”

Location Parameter:

1. Mean

2. Median

3. Regression(s) (LOWESS)

Account for systematic effects

Gaussian

Standardisation

“Scale”

Scale Parameter Stabilise variance

Gaussian

Data Normalisation

Transformation …to near normality

Raw Data…exponential-like

Log2 Transformed…normal-like

Armidale Animal Breeding Summer Course, UNE, Feb. 2006

Ratio Histogram

3000

3500

4000

4500

5000

Fre

qu

ency

The Starting Point: The Ratio (2-color arrays)

0

500

1000

1500

2000

2500

3000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Ratio

Fre

qu

ency

Log(ratio) Histogram

2000

2500

3000

Log ratios treat up- and down-regulated

genes equally

0

500

1000

1500

2000

-2

-1.8

-1.6

-1.4

-1.2 -1

-0.8

-0.6

-0.4

-0.2 0

0.2

0.4

0.6

0.8 1

1.2

1.4

1.6

1.8 2

Log(ratio)

Fre

qu

ency

log2(1) = 0log2(2) = 1log2(1/2) = -1(two-color arrays)

�Gene-specific variation

�Spotting (size and shape)

�Cross-hybridization

�Dye

�Global variation

RNA quality

Sample preparation

Dye

Two kinds of variation in the signal

Dye

�Biological variation� Effect

� Noise

Dye

Hybridization

Photodetection

Systematic Stochastic

Sources of Non-Biological Variation� Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation

� Differences in the amount of labeled cDNA hybridized to each channel in a microarray experiment (here channel is used to refer to a particular slide/dye combination.)

17

particular slide/dye combination.)

� Variation across replicate slides

� Variation across hybridization conditions

� Variation in scanning conditions

� Variation among technicians doing the lab work

� etc.

� Regional variations

� Up (red) and down (green) regulated genes should be randomly distributed across the slide (but they’re not)

Spatial variations

across the slide (but they’re not)

Green corner!

�Gene-specific variation:

• Too random to be explicitly accounted for

�Global variation:

• Similar effect on many measurements

Sources of variation

Systematic Stochastic

accounted for

• “noise”

measurements

• Corrections can be estimated from data

Normalization Statistical testing

What is Normalization?

� Calibration = Normalization = Scaling

� Normalization describes the process of removing (or minimizing) non-biological variation in measured signal intensity levels so that

20

biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected.

� Typically normalization attempts to remove global effects, i.e., effects that can be seen by examining plots that show all the data for a slide or slides.

Normalization - two problems

I. How do we detect biases? Which genes should we use for estimating biases among chips/channels?among chips/channels?

II. How do we remove the biases?

Which Genes to use for bias detection?

1. All genes on the chip� Assumption: Most of the genes are equally expressed in the

compared samples, the proportion of the differential genes is low (<20%).

� Limits: Limits: � Not appropriate when comparing highly heterogeneous samples

(different tissues)

� Not appropriate for analysis of ‘dedicated chips’ (apoptosis chips, inflammation chips etc)

Which Genes to use for bias detection?

2. Housekeeping genes• Assumption: based on prior knowledge a set of genes can be

regarded as equally expressed in the compared samples• In other words one or more genes assumed expressed at

constant rate.constant rate.• Affy novel chips: ‘normalization set’ of 100 genes • NHGRI’s cDNA microarrays: 70 "house-keeping" genes set • Limits:

� The validity of the assumption is questionable� Housekeeping genes are usually expressed at high levels, not informative

for the low intensities range

Which Genes to use for bias detection? 3. Spiked-in controls from other organism, over a range

of concentrations • Limits:

� low number of controls- less robust� Can’t detect biases due to differences in RNA extraction protocols

4. “Invariant set”4. “Invariant set”• Trying to identify genes that are expressed at similar levels

in the compared samples without relying on any prior knowledge:� Rank the genes in each chip according to their expression level� Find genes with small change in ranks

Normalization methods

1. Global normalization (Scaling)� A single normalization factor (k) is computed for balancing

chips\channels:

Xinorm = k*Xi or

log2 R/G � log2 R/G – c (2-color)

� Multiplying intensities by this factor equalizes the mean � Multiplying intensities by this factor equalizes the mean (median) intensity among compared chips

� Assumption: Total RNA (mass) used is same for both samples.

� So, averaged across thousands of genes, total hybridization should be the same for both samples.

Global Normalization (2-color)Before After

Linear normalization – doesn’t work well

2. Quantile normalization� A quantile has a fixed number of genes below it in intensity. The first quantile could have 1% of the genes below in intensity, and so on.

� Assumption: equalizing the entire distribution

� Sort intensities in each chipSort intensities in each chip

� Compute mean intensity in each rank across the chips

� Replace each intensity by the mean intensity at its rank

Chip #3 Average chip

Chip #1 Chip #2

Side-by-side boxplots show examples of variation across channels.

30

Cy3 Cy5Slide 1

Cy3 Cy5

median

Q3=75th percentile

Q1=25th percentile

maximum

31

Q1=25 percentile

minimum

Log Mean Signal Centered at 0

32

Log Mean Signal Centered at 0

3. Intensity-dependent normalization

� Compensate for intensity-dependent biases

M vs A plots (also called R-I plot)� M vs A plots (also called R-I plot)

� X axis: A – average intensity

A = 0.5*log(Cy3*Cy5)

� Y axis: M – log ratio

M = log(Cy3/Cy5)

Intensity-dependent bias

M = log(Cy3/Cy5)

M>0: Cy3>Cy5

ALow intensities

M<0: Cy3<Cy5

High intensities

* Global normalization cannot remove intensity-dependent biases

We expect the M vs A plot to look like:

M = log(Cy3/Cy5)

A

log(Cy3/Cy5)

The Qspline method (Workman et al. 2002)

�From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black)

�A QQ-plot is made and a normalization curve is constructed by fitting a cubic splinefunction

�As reference one can use an artificial “median array” for a set of arrays or use a log-normal distribution, which is a good approximation.

The Qspline method (Workman et al. 2002)

� Assumes overall intensity between two arrays should be comparable (Majority of genes assumed constant). This means that the quantiles of the distributions, plotted in a quantile-quantile plot , should lie on the diagonal. If they do not, they form a normalization curve that is signal do not, they form a normalization curve that is signal dependent.

� Even if the assumption does not hold, and less than a majority of genes are constant, the normalization still works provided that the number of up-regulated and down-regulated genes are roughly equal.

dChip (Li & Wong)� Assumption: subset of genes assumed constant.� The constant control genes are not known a-priori but are instead identified as genes whose intensity rank does not differ more than a threshold value between two arrays.

� This invariant set is defined iteratively and used to draw a normalization curve that is signal dependent.

� In practice it works well.

(Li and Wong, 2001)

RMA

�Robust Multi-array Average (RMA) expression measure (Irizarry et al., Biostatistics, 2003) (ignores MM probes)

�Assumes all chips have same background,

distribution of values.distribution of values.

�Fit this additive model by iteratively re-

weighted least-squares or median polish.

RMA Model

From a presentation by Ben Bolstadhttp://bmbolstad.com/talks/Bolstad-%20AffymetrixTalkAug072006.pdf

RMA: Fitting Model

� Median polish (figure)

� Robust � Robust regression


RMA: One Drawback


MAS5�(now GCOS/GDAS) by Affymetrix

�Unlike RMA compares PM and MM probes

�MM* is an adjusted MM that is never bigger than PM

�Signal = TukeyBiweight{log(PMj - MM*j)}

�Tukey biweight is a robust average procedure with

weights and outlier rejection

�Robust measure of center, like median or trimmed mean

MAS5: p-value and calls� First calculate discriminant for each probe pair:

R=(PM-MM)/(PM+MM)

� Wilcoxon one sided ranked test used to compare R vs τand determine p-value

� Present/Marginal/Absent calls are thresholded� Present/Marginal/Absent calls are thresholded� Present: p-value ≤ α1� Marginal: α1 < p-value < α2� Absent: p-value ≥ α2

� Default: α1 = 0.04, α2 = 0.06, τ=0.015

M*A Plot of MAS5

MAS5: Summary� Good

� Usable with single chips (though replicated preferable)

� Gives a p-value for expression data

� Bad:

� Lots of fudge factors in the algorithm

� M*A plots can be better

� Misc

� Commonly used processing method for Affy chips

� Highly dependent on Mismatch probes

Methods compared on expression variance

Standard deviation of gene measures

from 20 replicate arrays

� Std Dev of gene measures from 20 replicate arrays

RMA: Blue and Red

MAS5: Green

dChip: Black

Expression level

From Terry speed

LOWESS (Locally Weighted Scatterplot Smoothing)• Local linear regression model

•Tri-cube weight function

• Least Squares

Estimated values of log2(Cy5/Cy3) as function of log10(Cy3*Cy5)

Location Normalisation

LOWESS = Locally WEighted Regression andSmoothing Scatterplots

NormalisedIntensities

Source: G Rosa 2003.

4. Logit normalization

(Lemon et al. 2003)

� Background is calculated as the minimum intensity measured on the chip minus 0.1% of the signal intensity range: min-0.001*(max-min). The result is Z-transformed to mean zero and

)()(logitysaturation

backgroundyy

−

−=

0.001*(max-min). The result is Z-transformed to mean zero and sd one.

� Performance is surprisingly good.

5. Spike-in controls

� If none of the assumptions seems applicable to the experiment, add a spiked control to the mRNA preparation.

� The idea is to measure the amount of mRNA or total RNA extracted from the cell, and then add a known transcript of known concentration to the pool.

� This spiked transcript is assumed to be amplified and labeled the same � This spiked transcript is assumed to be amplified and labeled the same way as the other transcripts and detected with unique probe on the array.

� The spiked transcript must not match any gene in the RNA preparation. (For human preparation, use E.coli).

� After scanning you multiple all measurements on one array until the spiked control matches that on the other array.

Spike-in controls� This approach has the limitation that it results in a linear normalization that does not correct signal-dependent bias (unless you use many spiked control genes with different concentrations).

� Finally, it is only as accurate as the accuracy of measuring the � Finally, it is only as accurate as the accuracy of measuring the total amount of RNA and accuracy of adding an exact amount of spiked transcript.

Summary for normalization� Systematic biases exist in microarray data

� Normalization can remove (reduce) these biases� There are many different ways to normalize data

� Global median, dChip, MAS 5.0, RMA, LOWESS, etc

� Each normalization method has different assumption � Each normalization method has different assumption regarding the data.

� BUT: don’t expect it to fix bad data!� Won’t make up for lack of replicates� Won’t make up for horrible slides

The missing values problem

� Microarrays can have systematic or random missing values due to experimental artifacts/problems

� Some algorithms aren’t robust to missing values

� How do we “guess” at these values?

Why Estimate Missing Values?

Complete data set Data set with missing

values estimated by

KNNimpute algorithm

Data set with 30%

entries missing

(missing values appear

black)

KNNimpute Algorithm

� Idea: use genes with similar expression profiles to estimate missing values

� Replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.the distances from the neighboring columns.

2 | 4 | 5 | 7 | 3 | 2

2 | | 5 | 7 | 3 | 1

3 | 5 | 6 | 7 | 3 | 2

Gene X

Gene B

Gene C

j

2 | 4 | 5 | 7 | 3 | 2

2 |4.3| 5 | 7 | 3 | 1

3 | 5 | 6 | 7 | 3 | 2

Gene X

Gene B

Gene C

j

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001 Jun;17(6):520-5.

References�Li and Wong, (2001). Model-based analysis of oligonucleotide arrays: Model validation, design issues and standard error application.

�Genome Biology 2:1–11.

�Irizarry, Bolstad, Collin, Cope, Hobbs and Speed, (2003) Summaries of Affymetrix GeneChip probe level data.

�Nucleic Acids Research 31(4):e15.)

�Affymetrix. Affymetrix Microarray Suite User Guide. Affymetrix, Santa Clara, CA,

version 5 edition, 2001.

�Gautier, Cope, Bolstad, and Irizarry, (2003). affy - an r package for the analysis of affymetrix genechip data at the probe level. Bioinformatics

2. Identification of2. Identification of2. Identification of2. Identification of

Differential GenesDifferential GenesDifferential GenesDifferential Genes

Identification of differential genes

�The most basic experimental design: comparison between 2 conditions –‘treatment’ vs control

�The goal: to identify genes that are �The goal: to identify genes that are differentially expressed in the examined conditions

�Number of replicates is usually low (n=2-4)

1. Fold Change� Consider genes whose mean expression level was change by at least 1.75-2 fold as differential genes

� Limits:� Usually no estimation of false positive rate is provided

� Biased to genes with low expression level� Biased to genes with low expression level

� Ignores the variability of gene levels over replicates.

Fold Change limit – ignores variability

over replicates

control treatment

C1 C2 C3 mean_c t1 t2 t3 mean_t

g1 90 100 110 100 190 200 210 200

g2 50 100 150 100 100 150 350 200

� Seek for score that ‘punishes’ genes with high variability over replicates

g2 50 100 150 100 100 150 350 200

2. T-test

� Compute a t-score for each gene

m , m – mean levels in Control and Treatmentmc, mt – mean levels in Control and Treatment

Sc2, St

2 – variance estimates in Control and Treatment

nc, nt – number of replicates in in Control and Treatment

control treatment

C1 C2 C3 mean_c t1 t2 t3 mean_t t

g1 90 100 110 100 190 200 210 200 12

g2 50 100 150 100 100 150 350 200 1.3

T - test� t-scores can be associated with p-value (under the assumption that expression levels follow normal distribution)

� Log-transformation

� Set cut-off for p-value (α=0.01)

� Consider all genes with p-value < α as differential genes

C1 C2 C3 mean_c t1 t2 t3 mean_t t p-val

g1 90 100 110 100 190 200 210 200 12.2 0.0001

g2 50 100 150 100 100 150 350 200 1.3 0.14

Multiple Testing

� P-valg associated with the t-score Tg is the probability for obtaining by random a t-score that is at least as extreme as Tg.

� Multiplicity problem: thousands of genes are tested simultaneously.simultaneously.

� e.g. suppose:� 10,000 genes on a chip� If your p-value is 0.05 than you expect 5% type I errors.� 10000x0.05 = 500500 genes to be DE just by chance.� This in not acceptable!

Multiple testing

�Need to adjust for multiple testing when assessing the statistical significance of findings

�Corrections:

Bonferroni� The p-value of each gene is multiplied by the number of genes in the gene list. If the corrected p-value is still below the error rate, the gene will be significant:

� Corrected P-value= p-value * n (number of genes in test) <0.05<0.05

Holm’s correction Holm’s correction Holm’s correction Holm’s correction

((((BonferroniBonferroniBonferroniBonferroni StepStepStepStep----down)down)down)down)� This correction is very similar to the Bonferroni, but a little less stringent:

� 1) The p-value of each gene is ranked from the smallest to the largest.

� 2) The first p-value is multiplied by the number of genes present in the gene list:present in the gene list:if the end value is less than 0.05, the gene is significant:Corrected P-value= p-value * n < 0.05

� 3) The second p-value is multiplied by the number of genes less 1:

� Corrected P-value= p-value * n-1 < 0.05

Holm’s correction

Example:Let n=1000, error rate=0.05

Westfall and Young Permutation (Westfall and Young Permutation (Westfall and Young Permutation (Westfall and Young Permutation (minPminPminPminP) ) ) ) � The Westfall and Young permutation follows a step-down procedure similar to the Holm method, combined with a bootstrapping method to compute the p-value distribution:

� 1) P-values are calculated for each gene based on the original data set and ranked.

� 2) The permutation method creates a pseudo-data set by dividing the data into artificial treatment and control groups.the data into artificial treatment and control groups.

� 3) P-values for all genes are computed on the pseudo-data set.� 4) The successive minima of the new p-values are retained and compared to the original ones.

� 5) This process is repeated a large number of times, and the proportion of resampled data sets where the minimum pseudo-p-value is less than the original p-value is the adjusted p-value

Benjamini, Hochberg (1995)� Controlling the FDR at a specified level:

� After ranking the genes according to their significance (P-values) and starting at the top of the list , you accept all genes where:

� P <= i/m*q � P <= i/m*q

� (where i is the number accepted so far, m is the total number of genes tested, and q is the desired FDR.

� For i > 1 this correction is less strict than Bonferroni.

Benjamini Hochberg� Let n=1000, error rate=0.05

Permutation tests� FDR can also be assessed by permutation.

� If you permute the measurements from the sample and control and repeat the t-test for all genes, you get an estimate for the number of type-I errors, that can be expected at a chosen cutoff in significance.chosen cutoff in significance.

� Dividing this number by the number of genes that pass the t-test on the unpermuted data, you get the FDR. (approach of SAM).

Differential Genes - Tools

��CyberCyber--TT

��SAMSAM (Significance Analysis of Microarray) (Tusher et al. 2001).Microarray) (Tusher et al. 2001).

Testing for Differential Expression:

Significance Analysis of Microarrays (SAM)

� Two-group comparison: t-test with shrinkage

� 1 2Y Yd

−=

+�

� S0� shrinks t � H0� Empirical Bayes like� Multiple testing� Smooth distribution of test statistic

1 20Y Y

ds s

−

=+

SAM: Computation of s01. sα = α percentile of the si (s for the ith gene)

2. Compute the 100 quantiles of the si values �

q1 < q2 . . . < q1003. Let α be a value in (0, .05, .10 . . . 1.0)

1 2Y Ys

−

a) Compute median absolute deviation from the median, divided by .64 for all adjacent di’s� vj

b) Compute cv(α) = coefficient of variation of the vj

4. Choose the α with the smallest cv� s0 = sα

SAM: Multiclass Comparison

1/ 2

2

1

1/ 2

( )K

k

i k ik i

kk

nd n x x

n =

= −

∑ ∑∏

1/ 2

2

1

1 1( )

( 1)k

K

i ij ik

k j Ck k

s x xn n = ∈

= −

− ∑ ∑∑

∑

� Shrunken square root of F from one-way ANOVA

SAM: Testing for

Differential Expression

� Permute class labels with balance

� Rank all observed di’s

� Rank all d ’s for each Obs P1 P2 …

Perm.

Avg.� Rank all di’s for each permutation separately

� Average permuted d’s at given quantile

� Compare observed quantile dto average of permutations

2.23 1.85 1.92 … 1.94

2.11 1.78 1.90 … 1.93

1.91 1.74 1.83 … 1.76

… … … … …

SAM: Differential Expression & FDR� Choose Δ�

“significant” genes� Assymetric cutoff for + & - d’s

� Find smallest observed +d and least negative -d

� Each permutation � count number of +d’s ≥ critical obs number of +d’s ≥ critical obs +d and number of –d’s ≤ critical obs –d

� FDR = π0(Average # of False Discoveries)/(# of discoveries)� π0 = Prior prob. of not differentially expressed = 1 –proportion non-differentially expressed

computational genomics 10810 guy zinmanepxing/class/10810/lecture/recitation7.pdf · cdnaand...

Documents