bioinformatics expression profiling and functional genomics part i: preprocessing ad 29/10/2006

Bioinformatics

Expression profiling and functional genomics

Part I: PreprocessingAd 29/10/2006

• http://www.esat.kuleuven.ac.be/~kmarchal/• Course material: course notes + powerpoint files• Exercises

http://www.esat.kuleuven.ac.be/~kmarchal/



Overview

MICROARRAY PREPROCESSING

• Gene expression

• Omics era

• Transcript profiling

• Experiment design

• Preprocessing

• Exercises

mRNA

DNA

transcriptiontranscription

translationtranslation

+1+1

protein

protein

Gene expression

Adaptation of cell to its environment

FNR box cytN cytO cytQ cytP

??

Bacterial cell

ininoutout

Signal 1Signal 2Signal 2

Adaptation of a cell: response on environmental signalsresponse to e.g. hormones (cell differentiation)

Cellular response determined by the genes which are switched on upon a signal

Gene expression

Action of genetic networks underlie the observed phenotypical behavior

Gene expression

Overview


• Gene expression

• Omics era



• Preprocessing

• Exercises

Functional genomics

Structural Genomics

Comparative Genomics

Traditional molecular biology – Directed toward understanding the role of a particular gene or

protein in a molecular biological process– Northern analysis– Mutational analysis– Expression by reporter fusions

Omics era Measurement of the expression of 1000 of genes, proteins

simultaneously

Omics era

– The function or the expression of a gene in a global context of the cell

– Holistic approaches allow better understanding of fundamental molecular biological processes

Because a gene does not act on its own, it is always embedded in a larger network (systems biology)

Detection Reference Test

Reference sample Test sample

RNA RNA

cDNA cDNA

transcriptomics

mRNA

DNA



+1+1+1+1

proteinprotein

protein

Omics era

proteomics

mRNA

DNA



+1+1+1+1

proteinprotein

protein

Omics era

metabolomics

Omics era

SYSTEMS BIOLOGYConsider the cell as a system

Omics era

SYSTEMS BIOLOGY

Mechanistic insight in the biological system at molecular biological level

High throughput data

Omics era

• analysis of such large scale data is no longer trivial => computational challenges– Low signal/ noise– High dimensionality

• Simple spreadsheet analysis such as excel are no longer sufficient

• More advanced datamining procedures become necessary

• Another urgent problem is also how to store and organize all the information.

Bioinformatics

Omics era

Overview


• Gene expression

• Omics era

• Transcript profiling– Principle of microarray – Applications


• Preprocessing

• Exercises



RNA RNA

cDNA cDNA

transcriptomics

Transcript profiling

• Previously: measure expression level of one gene:Northern blot analysis

• Novel techniques: measure expression level of all genes simultaneously => EXPRESSION PROFILING

Principle: hybridisation

mRNA: 5’ –UGACCUGACG- 3’

cDNA 3’ -ACTGGACTGC-5’

Hybridize : stick together


Monitor molecular activities on a global level– protein levels proteomics, – enzyme activities– Metabolites– gene expression (mRNA), transcriptomics = transcript profiling

allows to gain a general insight in the global cell behavior (holistic)

Molecular biological methods

– RT-PCR

– SAGE

– Protein arrays

– Microarray analysis



cDNA clones

Printing slides

SLIDE PRODUCTION

Experiment design

Sample preparation

Hybridization & scanning

cDNA µA EXPERIMENT

DATA ANALYSIS

EXPERIMENTAL PROCEDURES

cDNA array

Spotted cDNA Glass side

Upscaled Northern hybridisation

++11

Gene (DNA)

Transcript (mRNA)

cDNA


Preparation of probes

• Collect cDNA clones

• Amplify target cDNA insert by PCR

• Check yield & specificity by electrophoresis

Spot + PCR products on glass slides




RNA RNA

cDNA cDNA


Signal 1 Signal 2Signal 2

2. mRNA isolation2. mRNA isolation

3. labeling3. labeling

4. Hybridization + washing4. Hybridization + washing 5. scanning5. scanning 6. Image analysis6. Image analysis

numerical value

1. Cell culture1. Cell culture


http://www.bio.davidson.edu/courses/genomics/chip/chip.html


Superimposed color image

* Transform into color images

* Superimpose color images from R and G channel

good alignment bad alignment


black spots : gene was neither expressed in test nor in control sample

green : gene was only expressed in control sample

red : gene was only expressed in test sample

yellow : gene was expressed both in test and in control sample

Superimposed color image


Signal intensity is proportional with the amount of cDNA present in the samplesignal cy3 -> numerical valuesignal cy5 -> numerical value

Data analysis

Image analysis



Data representation

Gene profileExperiment profile

Spotted DNA microarray High density oligonucleotide array


Overview


• Gene expression

• Omics era



• Preprocessing

• Exercises

Depending on experimental design other mathematical approach

• Comparison of 2 samples (black/white)

• Comparison of multiple arrays

• Global dynamic profiling

• Static experiment: Comparison of samples (mutants, patients)

Experiment Design

Type1: Comparison of 2 samples

Statistical testing

Control sample

Induced sample

Retrieve statistically over or under expressed genes

2 sample design

Experiment Design

black/white experiment description (array V mice genes)

• Condition 1 : pygmee mouse 10 days old (test)• Condition 2 : normal mouse 10 days old (ref)

detect differentially expressed genes

Experiment design (Latin Square)

Condition 1Dye1Replica L

Condition 1dye1Replica R

Condition 2dye2Replica L






Array 1

Array 2

Per gene, per condition 4 measurements available

Experiment Design

Measure expression of all genes

• During time (dynamic profile)

• In different conditions

Identify coexpressed genes

Identify mechanism of coregulation

Motif Finding

Clustering

Multiple array design

Experiment Design

Original dataset : 6178 genes

Preprocessing:• select 4634 most variable (25 % most variable)• variance normalized• adaptive quality based clustering (32 clusters) (95%)

Multiple array design• Study of Mitotic cell cycle of Saccharomyces cerevisiae with oligonucleotide

arrays (Cho et al.1999) - 15 time points (E=18)• time points 90 & 100 min deleted (Zhang et al. 1999, Tavazoie et al., 1999)

Experiment Design

• Reference: unsynchronized cells• Condition: synchronized cells during cell cycle at

distinct time intervals

Condition 1

Dye1

Replica L

Condition 2

Dye1

Replica L

Condition 3

Dye1

Replica L

Condition 4

Dye1

Replica L. …

Condition 19

Dye2

Replica L

Condition 19

Dye2

Replica L

Condition 19

Dye2

Replica L

Condition 19

Dye2

Replica L

Array 1

Reference design: e.g. Spellman dataset

Experiment Design

Loop design

Experiment Design

Overview


• Gene expression

• Omics era



• Preprocessing

– Sources of Variation

– General normalization steps

– Slide by slide normalization

– ANOVA normalization

Sources of variation– Overshine effects– Dye effect– Spot effects– Array effect

Consistent errors

• Consistent errors complicate direct comparison of measurements of the same gene/condition

• Consistent errors need to be removed by preprocessing/normalization

Preprocessing

• Tedious• Influences downstream measurements





numerical value


Preprocessing

Dye effectDye effect

Dye, condition effect: within slide variation

Measurement error: – Preparation mRNA– Labeling &reverse transcription

Normalization

Global normalization assumption

Overall signal in one channel more pronounced than in other channel

0)/(log2 reftest

Preprocessing





numerical value


Preprocessing

Array effectArray effect

• normalization within slide

• ratio

Differences in global intensity between slides

Comparison between slides impossible

Array effects: between slide variation

Preprocessing

Hybridization differences

Array effects: Between slide variation

-7

-5

-3

-1

1

3

5

7

1

Q1

maxvalue

minvalue

Q3

-7

-5

-3

-1

1

3

5

7

1

Ser ies1

Ser ies2

Ser ies3

Ser ies4

-9

-7

-5

-3

-1

1

3

5

7

1 Ser ies1

Ser ies2

Ser ies3

Ser ies4

Preprocessing

Measurement error: Different quantity of DNA in spot

Difference in duplicate spots

Ratio: compare differential expression between genes

Spot effect

Absolute levels between genes incomparable

Gene 1: test: 4 ref:2 R/G:2

Gene 2: test: 8 ref:4 R/G:2

Pin main effects: spot effects

Preprocessing

Non specific signal Cy5 or Cy3 resulting from overshining= emission from neighboring spots

Overshine effects: within slide variation

Preprocessing

Background intensity increases with the intensity of the neighboring spots

Removing sources of variation is obligatory step

• To make comparisons within a slide possible• E.g. find differentially expressed genes

• To allow interslide comparisons• E.g. combining the replica’s of the original experiment and the color flip

Preprocessing

OverviewMICROARRAY PREPROCESSING

• Gene expression

• Omics era



• Preprocessing– Sources of Variation



– ANOVA normalization ANOVA

ANOVA based

Filtering

Linearisation

Bootstrapping

Log transformation

Array by array approach

Filtering

normalization

Ratio

Test statistic (T-test)

Log transformation

Preprocessing

Background corrBackground corr

• Background correction compensates for overshining• Background correction is considered additive

Preprocessing: Background correction

Background correction

ANOVA based

Filtering

Linearisation

Bootstrapping

Log transformation


Filtering

normalization

Ratio


Log transformation

Preprocessing


• additive error: independent on the measured intensity the absolute level of the error remains the same (at low levels high relative error, at high expression levels low relative error).

• multiplicative error: the error increases with the measured intensity (at high levels high relative error)

Multiplicative error

Preprocessing: log transformation

LOG2 transformed intensity values: Multiplicative effects removed, additive effects more pronounced

residuals are constant at high intensities

Additive error: error increases as the signal is lower (intuitively plausible)


• Log (test/ref) = log2(test)-log2(ref): • upregulation range 0…+infinity• downregulation range 0…-infinity

2 fold overexpression2 fold underexpression

Ratio = 2Ratio = 0.5

Log2(Ratio) = 1Log2(Ratio) = -1

• ratio (test/ref) test>ref upregulation range 1…+infinity• test<ref downregulation range 0...1: range of downregulation squashed

Why log2


ANOVA based

Filtering

Linearisation

Bootstrapping

Log transformation


Filtering

normalization

Ratio


Log transformation

Preprocessing


• Spots are identified by Image analysis– Array Vision– ImaGene– Matarray

Spot detection and signal acquisition

e.g. Signal is definedMean pixel intensity of all pixels in a spot for which the Intensity is higher than the local background + 2SD

• Spots can have different qualities– Irregular spots– Spots with excessive large diameter– Spots which are extremely small

artifacts

Preprocessing: filtering

Red >0.1 stdevGreen >1 stdevBlue >2 stdev


Filtering:

Zero values: treat these separatelyratiolog transformation

Zero values: black white experiment interestinggenes off in condition 1 versus on in condition 2

Undefined


• Some genes only labeled with green dye , not with red dye• If no mRNA of a gene is present, the green dye binds aspecifically to a spot?

color flip essential to eliminate false positives

Seemingly underexpressed

cloneIdexp1 LroodItest RroodItest LgroenIref RgroenIref26635 2106.563 0 101692.979 10399.82227141 836.407 0 123838.567 45432.93127500 803.205 0 111507.935 72379.88728152 0 1331.273 9263.894 14005.90528333 0 1255.175 87102.68 9188.58728756 363.247 0 115771.253 88541.34330694 924.256 0 22029.599 50306.219

cloneIdexp2 LgroenItestRgroenItest LroodIref RroodIref26635 14376.307 12190.883 0 995.69427141 14804.307 13242.277 1315.193 762.17227500 22051.507 18835.761 0 028152 29270.26 26939.077 90.713 3402.7328333 25964.137 22326.256 0 028756 14270.607 20442.069 0 1007.76330694 20150.615 19003.462 4750.326 7988.791



• Gene expression

• Omics era



• Preprocessing– Sources of Variation



– ANOVA normalization

Overview

ANOVA based

Filtering

Linearisation

Bootstrapping

Log transformation


Filtering

normalization

Ratio


Log transformation

Preprocessing


• On average ratio red/green should be 1

– Rescale based on average of housekeeping genes

– Rescale based on spikes

– Rescale based on average expression value of the full array (global normalization)

• Methods used for normalization

– linear normalization

– Intensity dependent normalization

Preprocessing: normalization

Linear Normalization

G

R

G

R


– Red and green related by a constant factor– Calculate factor by linear regression

Log2(ratio)0 Log2(ratio)0

• Linear normalization factor determined by linear regression

• Filtering to remove outliers in the non-linear range (green values)

•http://afgc.stanford.edu/~finkel/talk.htm


Linear normalization not straightforward,…

Log2

(R/G

)

(Log2(R) + Log2(G))/2

Linear fit

Lowess fit


Non-linear intensity dependent normalization

Lowess (Dudoit et al., 2000) : genes seemingly underexpressed due to specific dye effect will be compensated for

Log R and log G recalculated based on the lowess fit

Lowess linearizes and normalizes the data !!!!!


Intensity dependent normalization


Result of the normalizationA. Before normalization

-7

-5

-3

-1

1

3

5

7

1

Q1

maxvalue

minvalue

Q3

-7

-5

-3

-1

1

3

5

7

1

Ser ies1

Ser ies2

Ser ies3

Ser ies4

-9

-7

-5

-3

-1

1

3

5

7

1 Ser ies1

Ser ies2

Ser ies3

Ser ies4

B. After normalization

RATIO1_NORM

-6

-4

-2

0

2

4

6

1

Q1

maxvalue

minvalue

Q3

RATIO2_NORM

-6

-4

-2

0

2

4

6

1

Q1

maxvalue

minvalue

Q3

RATIO3_NORM

-6

-4

-2

0

2

4

6

1

Q1

maxvalue

minvalue

Q3


ANOVA based

Filtering

Linearisation

Bootstrapping

Log transformation


Filtering

normalization

Ratio


Log transformation

Preprocessing


• Compensates for spot effects

• Choice of the reference important

– Intuitive reference:• First time point• Uninduced sample

– Independent reference (reference design)• Tissue mixture

Intuitive interpretation possible

Ratio often undefined

interpretation complicated

Ratio defined

Preprocessing: ratio

• Log ratio: • upregulation range 0…+infinity• downregulation range 0…-infinity

2 fold overexpression2 fold underexpression

Ratio = 2Ratio = 0.5

Log2(Ratio) = 1Log2(Ratio) = -1

• ratio (R/G): • R>G upregulation range 1…+infinity• R<G downregulation range 0...1: range of downregulation squashed

Preprocessing: ratio

ANOVA based

Filtering

Linearisation

Bootstrapping

Log transformation


Filtering

normalization

Ratio


Log transformation

Preprocessing


Overview further analysis

Raw data

Preprocessed data

Differentially expressed genes

Clusters of coexpressed

genes

Preprocessing

ClusteringTest statistic

ANOVA based

Filtering

Linearisation

Bootstrapping

Log transformation


Filtering

Normalization

Ratio


Log transformation


Preprocessing

I. MAIN EFFECTS + EFFECT OF INTEREST

Overall mean

Array effect

(hybridisation effciency)

Condition effect

(mRNA isolation effciency)

Gene effect

Constitutive level of gene

GC effect

Differential expression due to the altered variety

Dye effect

(labeling efficiency)

ijnmkijmnjiijnmk GCDACGy

Model the expression level of each as a combination of the different factorsLeast squares fit:

• subject to restrictions

• contrast of interest: estimate (GC)i1 – (GC)i2

MultiFactor, Linear, fixed levels

Preprocessing: ANOVA

Assumption:

Independent, additive error ~F where F is a distribution with mean and variance 2


Plot the residualsyestimated - ymeasured

Estimated intensity



Analysis of variance shows relative contribution of each of the effects


Explains the relative contribution of each of these effects


Advantages:

• Gains more information with less observations=> derives variation from all measurements made (less replica’s required e.g. array effect based on N-1 gene measurements)

• Statistical testing: estimated error can be used for bootstrapping to estimate confidence levels

• No ratio’s requiredRequirements:

• Requires knowledge about experimental effects• Model used implicates that all effects and combinations of

effects should be linear• Bootstrapping: residuals should be normally distributed around

zero with constant variance



Estimate error

Simulate new datasets based on estimated error (3000 times)

Calculate factor of interest (GC effect) for each bootstrapped dataset (recalculate ANOVA)

Calculate CI on (GC1-GC2) of N genes based on 3000 bootstraps

Use this interval to test for significant genes

ijnmkijnmboot yy

0GC1-GC2

ANOVA Bootstrap analysis


DATA• Filtered for zero values• set 1: unnormalised data

MODELS (Kerr et al. 2000, 2001)• Model 1 (no spot effects)• Model 2 (spot effects independent)• Model 3 (spot effects dependent)

MODELS• GC effects not confounded with the spot effects• type of model does influence the (residual error)=> Does influence the bootstrap interval

More Arrays Simulaneously Preprocessing


Overall mean

Array effect

(hybridisation effciency)

Condition effect

(mRNA isolation effciency)

Gene effect

Constitutive level of gene

GC effect

Differential expression due to the altered variety

Dye effect

(labeling efficiency)



Least squares fit:• subject to restrictions• contrast of interest: estimate (VG)k1g – (VG)k2g• Usual confidence intervals based on normal theory not appropriate

Bootstrap analysis of residuals avoid making distributional assumptions about error

Assumption:

Independent, additive error ~F where F is a distribution with mean and variance 2



ŷ

ŷŷ

ŷ

TEST, ARRAY 1

REFERENCE, ARRAY 1

REFERENCE, ARRAY 2

TEST, ARRAY 2


More Arrays Simulaneously

Additive error and non linear effects undermine application of ANOVA

Preprocessing

ŷ

ŷŷ

ŷ

TEST, ARRAY 1

REFERENCE, ARRAY 1

REFERENCE, ARRAY 2

TEST, ARRAY 2


Lowess

ijnmkijkinimnjiijnmk GCRGAGDACGy

99 % confidence interval based on 100 genes, 3000 bootstraps

retained 370 genes (62 T-test p value < 0.01)

Bootstrap analysis

ID Rat_1 Rat_2 Rat_3 Rat_4 p D_GC_effects285 -3.31674 -3.20904 -2.08115 -1.62183 0.008818 -2.577397

1076 -1.39327 -2.04573 -1.85822 -2.42609 0.002899 -2.1754383755 -0.81029 -1.50631 -0.99613 -1.40283 0.005643 -1.245061

Preprocessing

Methods tested on pygmee dataset 3750 genes

1. ANOVA 99 % CI

2. ANOVA 95 % CI

3. SAM

4. T-test

5. Fold test

Retained 360 genes

Construct for each gene a binary profile 1 1 1 1 1

Hierarchically cluster genes based on this profile

methodsComparison

Only 8 genes retained by all methods

methodsmethodsComparison

methodsComparison

• Latin Square (mouse data set)

• Reference: normal mouse• Condition: pygmee mouse• Two experiments C=1, C=2 reflects two sample time points• 2 batches: not all genes of the genome on one array

A 1, C 1 B1

Test = R

Ref = G

A 2, C 1 B1

Test = G

Ref = R

A 5, C 2 B1

Test = R

Ref = G

A 6, C 2 B1

Test = G

Ref = R

A 3, C 1 B2

Test = R

Ref = G

A 4, C 1 B2

Test = R

Ref = G

A 7, C 2

B2

Test = R

Ref = G

A 8, C 2 B2

Test = G

Ref = R

Transcript profiling Experiment Design

bioinformatics expression profiling and functional genomics part i: preprocessing ad 29/10/2006

Documents