bioinformatics expression profiling and functional genomics part i: preprocessing ad 29/10/2006
DESCRIPTION
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing ExercisesTRANSCRIPT
Bioinformatics
Expression profiling and functional genomics
Part I: PreprocessingAd 29/10/2006
• http://www.esat.kuleuven.ac.be/~kmarchal/• Course material: course notes + powerpoint files• Exercises
Overview
MICROARRAY PREPROCESSING
• Gene expression
• Omics era
• Transcript profiling
• Experiment design
• Preprocessing
• Exercises
mRNA
DNA
transcriptiontranscription
translationtranslation
+1+1
protein
protein
Gene expression
Adaptation of cell to its environment
FNR box cytN cytO cytQ cytP
??
Bacterial cell
ininoutout
Signal 1Signal 2Signal 2
Adaptation of a cell: response on environmental signalsresponse to e.g. hormones (cell differentiation)
Cellular response determined by the genes which are switched on upon a signal
Gene expression
Action of genetic networks underlie the observed phenotypical behavior
Gene expression
Overview
MICROARRAY PREPROCESSING
• Gene expression
• Omics era
• Transcript profiling
• Experiment design
• Preprocessing
• Exercises
Functional genomics
Structural Genomics
Comparative Genomics
Traditional molecular biology – Directed toward understanding the role of a particular gene or
protein in a molecular biological process– Northern analysis– Mutational analysis– Expression by reporter fusions
Omics era Measurement of the expression of 1000 of genes, proteins
simultaneously
Omics era
– The function or the expression of a gene in a global context of the cell
– Holistic approaches allow better understanding of fundamental molecular biological processes
Because a gene does not act on its own, it is always embedded in a larger network (systems biology)
Detection Reference Test
Reference sample Test sample
RNA RNA
cDNA cDNA
transcriptomics
mRNA
DNA
transcriptiontranscription
translationtranslation
+1+1+1+1
proteinprotein
protein
Omics era
proteomics
mRNA
DNA
transcriptiontranscription
translationtranslation
+1+1+1+1
proteinprotein
protein
Omics era
metabolomics
Omics era
SYSTEMS BIOLOGYConsider the cell as a system
Omics era
SYSTEMS BIOLOGY
Mechanistic insight in the biological system at molecular biological level
High throughput data
Omics era
• analysis of such large scale data is no longer trivial => computational challenges– Low signal/ noise– High dimensionality
• Simple spreadsheet analysis such as excel are no longer sufficient
• More advanced datamining procedures become necessary
• Another urgent problem is also how to store and organize all the information.
Bioinformatics
Omics era
Overview
MICROARRAY PREPROCESSING
• Gene expression
• Omics era
• Transcript profiling– Principle of microarray – Applications
• Experiment design
• Preprocessing
• Exercises
Detection Reference Test
Reference sample Test sample
RNA RNA
cDNA cDNA
transcriptomics
Transcript profiling
• Previously: measure expression level of one gene:Northern blot analysis
• Novel techniques: measure expression level of all genes simultaneously => EXPRESSION PROFILING
Principle: hybridisation
mRNA: 5’ –UGACCUGACG- 3’
cDNA 3’ -ACTGGACTGC-5’
Hybridize : stick together
Transcript profiling
Monitor molecular activities on a global level– protein levels proteomics, – enzyme activities– Metabolites– gene expression (mRNA), transcriptomics = transcript profiling
allows to gain a general insight in the global cell behavior (holistic)
Molecular biological methods
– RT-PCR
– SAGE
– Protein arrays
– Microarray analysis
Transcript profiling
Transcript profiling
cDNA clones
Printing slides
SLIDE PRODUCTION
Experiment design
Sample preparation
Hybridization & scanning
cDNA µA EXPERIMENT
DATA ANALYSIS
EXPERIMENTAL PROCEDURES
cDNA array
Spotted cDNA Glass side
Upscaled Northern hybridisation
++11
Gene (DNA)
Transcript (mRNA)
cDNA
Transcript profiling
Preparation of probes
• Collect cDNA clones
• Amplify target cDNA insert by PCR
• Check yield & specificity by electrophoresis
Spot + PCR products on glass slides
Transcript profiling
Detection Reference Test
Reference sample Test sample
RNA RNA
cDNA cDNA
Transcript profiling
Signal 1 Signal 2Signal 2
2. mRNA isolation2. mRNA isolation
3. labeling3. labeling
4. Hybridization + washing4. Hybridization + washing 5. scanning5. scanning 6. Image analysis6. Image analysis
numerical value
1. Cell culture1. Cell culture
Transcript profiling
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
Transcript profiling
Superimposed color image
* Transform into color images
* Superimpose color images from R and G channel
good alignment bad alignment
Transcript profiling
black spots : gene was neither expressed in test nor in control sample
green : gene was only expressed in control sample
red : gene was only expressed in test sample
yellow : gene was expressed both in test and in control sample
Superimposed color image
Transcript profiling
Signal intensity is proportional with the amount of cDNA present in the samplesignal cy3 -> numerical valuesignal cy5 -> numerical value
Data analysis
Image analysis
Transcript profiling
Transcript profiling
Data representation
Gene profileExperiment profile
Spotted DNA microarray High density oligonucleotide array
Transcript profiling
Overview
MICROARRAY PREPROCESSING
• Gene expression
• Omics era
• Transcript profiling
• Experiment design
• Preprocessing
• Exercises
Depending on experimental design other mathematical approach
• Comparison of 2 samples (black/white)
• Comparison of multiple arrays
• Global dynamic profiling
• Static experiment: Comparison of samples (mutants, patients)
Experiment Design
Type1: Comparison of 2 samples
Statistical testing
Control sample
Induced sample
Retrieve statistically over or under expressed genes
2 sample design
Experiment Design
black/white experiment description (array V mice genes)
• Condition 1 : pygmee mouse 10 days old (test)• Condition 2 : normal mouse 10 days old (ref)
detect differentially expressed genes
Experiment design (Latin Square)
Condition 1Dye1Replica L
Condition 1dye1Replica R
Condition 2dye2Replica L
Condition 2dye2Replica R
Condition 2dye1Replica L
Condition 2dye1Replica R
Condition 1dye2Replica L
Condition 1dye2Replica R
Array 1
Array 2
Per gene, per condition 4 measurements available
Experiment Design
Measure expression of all genes
• During time (dynamic profile)
• In different conditions
Identify coexpressed genes
Identify mechanism of coregulation
Motif Finding
Clustering
Multiple array design
Experiment Design
Original dataset : 6178 genes
Preprocessing:• select 4634 most variable (25 % most variable)• variance normalized• adaptive quality based clustering (32 clusters) (95%)
Multiple array design• Study of Mitotic cell cycle of Saccharomyces cerevisiae with oligonucleotide
arrays (Cho et al.1999) - 15 time points (E=18)• time points 90 & 100 min deleted (Zhang et al. 1999, Tavazoie et al., 1999)
Experiment Design
• Reference: unsynchronized cells• Condition: synchronized cells during cell cycle at
distinct time intervals
Condition 1
Dye1
Replica L
Condition 2
Dye1
Replica L
Condition 3
Dye1
Replica L
Condition 4
Dye1
Replica L. …
Condition 19
Dye2
Replica L
Condition 19
Dye2
Replica L
Condition 19
Dye2
Replica L
Condition 19
Dye2
Replica L
Array 1
Reference design: e.g. Spellman dataset
Experiment Design
Loop design
Experiment Design
Overview
MICROARRAY PREPROCESSING
• Gene expression
• Omics era
• Transcript profiling
• Experiment design
• Preprocessing
– Sources of Variation
– General normalization steps
– Slide by slide normalization
– ANOVA normalization
Sources of variation– Overshine effects– Dye effect– Spot effects– Array effect
Consistent errors
• Consistent errors complicate direct comparison of measurements of the same gene/condition
• Consistent errors need to be removed by preprocessing/normalization
Preprocessing
• Tedious• Influences downstream measurements
Signal 1 Signal 2Signal 2
2. mRNA isolation2. mRNA isolation
3. labeling3. labeling
4. Hybridization + washing4. Hybridization + washing 5. scanning5. scanning 6. Image analysis6. Image analysis
numerical value
1. Cell culture1. Cell culture
Preprocessing
Dye effectDye effect
Dye, condition effect: within slide variation
Measurement error: – Preparation mRNA– Labeling &reverse transcription
Normalization
Global normalization assumption
Overall signal in one channel more pronounced than in other channel
0)/(log2 reftest
Preprocessing
Signal 1 Signal 2Signal 2
2. mRNA isolation2. mRNA isolation
3. labeling3. labeling
4. Hybridization + washing4. Hybridization + washing 5. scanning5. scanning 6. Image analysis6. Image analysis
numerical value
1. Cell culture1. Cell culture
Preprocessing
Array effectArray effect
• normalization within slide
• ratio
Differences in global intensity between slides
Comparison between slides impossible
Array effects: between slide variation
Preprocessing
Hybridization differences
Array effects: Between slide variation
-7
-5
-3
-1
1
3
5
7
1
Q1
maxvalue
minvalue
Q3
-7
-5
-3
-1
1
3
5
7
1
Ser ies1
Ser ies2
Ser ies3
Ser ies4
-9
-7
-5
-3
-1
1
3
5
7
1 Ser ies1
Ser ies2
Ser ies3
Ser ies4
Preprocessing
Measurement error: Different quantity of DNA in spot
Difference in duplicate spots
Ratio: compare differential expression between genes
Spot effect
Absolute levels between genes incomparable
Gene 1: test: 4 ref:2 R/G:2
Gene 2: test: 8 ref:4 R/G:2
Pin main effects: spot effects
Preprocessing
Non specific signal Cy5 or Cy3 resulting from overshining= emission from neighboring spots
Overshine effects: within slide variation
Preprocessing
Background intensity increases with the intensity of the neighboring spots
Removing sources of variation is obligatory step
• To make comparisons within a slide possible• E.g. find differentially expressed genes
• To allow interslide comparisons• E.g. combining the replica’s of the original experiment and the color flip
Preprocessing
OverviewMICROARRAY PREPROCESSING
• Gene expression
• Omics era
• Transcript profiling
• Experiment design
• Preprocessing– Sources of Variation
– General normalization steps
– Slide by slide normalization
– ANOVA normalization ANOVA
ANOVA based
Filtering
Linearisation
Bootstrapping
Log transformation
Array by array approach
Filtering
normalization
Ratio
Test statistic (T-test)
Log transformation
Preprocessing
Background corrBackground corr
• Background correction compensates for overshining• Background correction is considered additive
Preprocessing: Background correction
Background correction
ANOVA based
Filtering
Linearisation
Bootstrapping
Log transformation
Array by array approach
Filtering
normalization
Ratio
Test statistic (T-test)
Log transformation
Preprocessing
Background corrBackground corr
• additive error: independent on the measured intensity the absolute level of the error remains the same (at low levels high relative error, at high expression levels low relative error).
• multiplicative error: the error increases with the measured intensity (at high levels high relative error)
Multiplicative error
Preprocessing: log transformation
LOG2 transformed intensity values: Multiplicative effects removed, additive effects more pronounced
residuals are constant at high intensities
Additive error: error increases as the signal is lower (intuitively plausible)
Preprocessing: log transformation
Preprocessing: log transformation
• Log (test/ref) = log2(test)-log2(ref): • upregulation range 0…+infinity• downregulation range 0…-infinity
2 fold overexpression2 fold underexpression
Ratio = 2Ratio = 0.5
Log2(Ratio) = 1Log2(Ratio) = -1
• ratio (test/ref) test>ref upregulation range 1…+infinity• test<ref downregulation range 0...1: range of downregulation squashed
Why log2
Preprocessing: log transformation
ANOVA based
Filtering
Linearisation
Bootstrapping
Log transformation
Array by array approach
Filtering
normalization
Ratio
Test statistic (T-test)
Log transformation
Preprocessing
Background corrBackground corr
• Spots are identified by Image analysis– Array Vision– ImaGene– Matarray
Spot detection and signal acquisition
e.g. Signal is definedMean pixel intensity of all pixels in a spot for which the Intensity is higher than the local background + 2SD
• Spots can have different qualities– Irregular spots– Spots with excessive large diameter– Spots which are extremely small
artifacts
Preprocessing: filtering
Red >0.1 stdevGreen >1 stdevBlue >2 stdev
Preprocessing: filtering
Filtering:
Zero values: treat these separatelyratiolog transformation
Zero values: black white experiment interestinggenes off in condition 1 versus on in condition 2
Undefined
Preprocessing: filtering
• Some genes only labeled with green dye , not with red dye• If no mRNA of a gene is present, the green dye binds aspecifically to a spot?
color flip essential to eliminate false positives
Seemingly underexpressed
cloneIdexp1 LroodItest RroodItest LgroenIref RgroenIref26635 2106.563 0 101692.979 10399.82227141 836.407 0 123838.567 45432.93127500 803.205 0 111507.935 72379.88728152 0 1331.273 9263.894 14005.90528333 0 1255.175 87102.68 9188.58728756 363.247 0 115771.253 88541.34330694 924.256 0 22029.599 50306.219
cloneIdexp2 LgroenItestRgroenItest LroodIref RroodIref26635 14376.307 12190.883 0 995.69427141 14804.307 13242.277 1315.193 762.17227500 22051.507 18835.761 0 028152 29270.26 26939.077 90.713 3402.7328333 25964.137 22326.256 0 028756 14270.607 20442.069 0 1007.76330694 20150.615 19003.462 4750.326 7988.791
Preprocessing: filtering
MICROARRAY PREPROCESSING
• Gene expression
• Omics era
• Transcript profiling
• Experiment design
• Preprocessing– Sources of Variation
– General normalization steps
– Slide by slide normalization
– ANOVA normalization
Overview
ANOVA based
Filtering
Linearisation
Bootstrapping
Log transformation
Array by array approach
Filtering
normalization
Ratio
Test statistic (T-test)
Log transformation
Preprocessing
Background corrBackground corr
• On average ratio red/green should be 1
– Rescale based on average of housekeeping genes
– Rescale based on spikes
– Rescale based on average expression value of the full array (global normalization)
• Methods used for normalization
– linear normalization
– Intensity dependent normalization
Preprocessing: normalization
Linear Normalization
G
R
G
R
Preprocessing: normalization
– Red and green related by a constant factor– Calculate factor by linear regression
Log2(ratio)0 Log2(ratio)0
• Linear normalization factor determined by linear regression
• Filtering to remove outliers in the non-linear range (green values)
•http://afgc.stanford.edu/~finkel/talk.htm
Preprocessing: normalization
Linear normalization not straightforward,…
Log2
(R/G
)
(Log2(R) + Log2(G))/2
Linear fit
Lowess fit
Preprocessing: normalization
Non-linear intensity dependent normalization
Lowess (Dudoit et al., 2000) : genes seemingly underexpressed due to specific dye effect will be compensated for
Log R and log G recalculated based on the lowess fit
Lowess linearizes and normalizes the data !!!!!
Preprocessing: normalization
Intensity dependent normalization
Preprocessing: normalization
Result of the normalizationA. Before normalization
-7
-5
-3
-1
1
3
5
7
1
Q1
maxvalue
minvalue
Q3
-7
-5
-3
-1
1
3
5
7
1
Ser ies1
Ser ies2
Ser ies3
Ser ies4
-9
-7
-5
-3
-1
1
3
5
7
1 Ser ies1
Ser ies2
Ser ies3
Ser ies4
B. After normalization
RATIO1_NORM
-6
-4
-2
0
2
4
6
1
Q1
maxvalue
minvalue
Q3
RATIO2_NORM
-6
-4
-2
0
2
4
6
1
Q1
maxvalue
minvalue
Q3
RATIO3_NORM
-6
-4
-2
0
2
4
6
1
Q1
maxvalue
minvalue
Q3
Preprocessing: normalization
ANOVA based
Filtering
Linearisation
Bootstrapping
Log transformation
Array by array approach
Filtering
normalization
Ratio
Test statistic (T-test)
Log transformation
Preprocessing
Background corrBackground corr
• Compensates for spot effects
• Choice of the reference important
– Intuitive reference:• First time point• Uninduced sample
– Independent reference (reference design)• Tissue mixture
Intuitive interpretation possible
Ratio often undefined
interpretation complicated
Ratio defined
Preprocessing: ratio
• Log ratio: • upregulation range 0…+infinity• downregulation range 0…-infinity
2 fold overexpression2 fold underexpression
Ratio = 2Ratio = 0.5
Log2(Ratio) = 1Log2(Ratio) = -1
• ratio (R/G): • R>G upregulation range 1…+infinity• R<G downregulation range 0...1: range of downregulation squashed
Preprocessing: ratio
ANOVA based
Filtering
Linearisation
Bootstrapping
Log transformation
Array by array approach
Filtering
normalization
Ratio
Test statistic (T-test)
Log transformation
Preprocessing
Background corrBackground corr
Overview further analysis
Raw data
Preprocessed data
Differentially expressed genes
Clusters of coexpressed
genes
Preprocessing
ClusteringTest statistic
ANOVA based
Filtering
Linearisation
Bootstrapping
Log transformation
Array by array approach
Filtering
Normalization
Ratio
Test statistic (T-test)
Log transformation
Background corrBackground corr
Preprocessing
I. MAIN EFFECTS + EFFECT OF INTEREST
Overall mean
Array effect
(hybridisation effciency)
Condition effect
(mRNA isolation effciency)
Gene effect
Constitutive level of gene
GC effect
Differential expression due to the altered variety
Dye effect
(labeling efficiency)
ijnmkijmnjiijnmk GCDACGy
Model the expression level of each as a combination of the different factorsLeast squares fit:
• subject to restrictions
• contrast of interest: estimate (GC)i1 – (GC)i2
MultiFactor, Linear, fixed levels
Preprocessing: ANOVA
Assumption:
Independent, additive error ~F where F is a distribution with mean and variance 2
ijnmkijmnjiijnmk GCDACGy
Plot the residualsyestimated - ymeasured
Estimated intensity
Preprocessing: ANOVA
I. MAIN EFFECTS + EFFECT OF INTEREST
Analysis of variance shows relative contribution of each of the effects
ijnmkijmnjiijnmk GCDACGy
Explains the relative contribution of each of these effects
Preprocessing: ANOVA
Advantages:
• Gains more information with less observations=> derives variation from all measurements made (less replica’s required e.g. array effect based on N-1 gene measurements)
• Statistical testing: estimated error can be used for bootstrapping to estimate confidence levels
• No ratio’s requiredRequirements:
• Requires knowledge about experimental effects• Model used implicates that all effects and combinations of
effects should be linear• Bootstrapping: residuals should be normally distributed around
zero with constant variance
Preprocessing: ANOVA
ijnmkijmnjiijnmk GCDACGy
Estimate error
Simulate new datasets based on estimated error (3000 times)
Calculate factor of interest (GC effect) for each bootstrapped dataset (recalculate ANOVA)
Calculate CI on (GC1-GC2) of N genes based on 3000 bootstraps
Use this interval to test for significant genes
ijnmkijnmboot yy
0GC1-GC2
ANOVA Bootstrap analysis
Preprocessing: ANOVA
DATA• Filtered for zero values• set 1: unnormalised data
MODELS (Kerr et al. 2000, 2001)• Model 1 (no spot effects)• Model 2 (spot effects independent)• Model 3 (spot effects dependent)
MODELS• GC effects not confounded with the spot effects• type of model does influence the (residual error)=> Does influence the bootstrap interval
More Arrays Simulaneously Preprocessing
DATA• Filtered for zero values• set 1: unnormalised data
MODELS (Kerr et al. 2000, 2001)• Model 1 (no spot effects)• Model 2 (spot effects independent)• Model 3 (spot effects dependent)
MODELS• GC effects not confounded with the spot effects• type of model does influence the (residual error)=> Does influence the bootstrap interval
More Arrays Simulaneously Preprocessing
I. MAIN EFFECTS + EFFECT OF INTEREST
Overall mean
Array effect
(hybridisation effciency)
Condition effect
(mRNA isolation effciency)
Gene effect
Constitutive level of gene
GC effect
Differential expression due to the altered variety
Dye effect
(labeling efficiency)
ijnmkijmnjiijnmk GCDACGy
More Arrays Simulaneously Preprocessing
Least squares fit:• subject to restrictions• contrast of interest: estimate (VG)k1g – (VG)k2g• Usual confidence intervals based on normal theory not appropriate
Bootstrap analysis of residuals avoid making distributional assumptions about error
Assumption:
Independent, additive error ~F where F is a distribution with mean and variance 2
ijnmkijmnjiijnmk GCDACGy
More Arrays Simulaneously Preprocessing
More Arrays Simulaneously Preprocessing
ŷ
ŷŷ
ŷ
TEST, ARRAY 1
REFERENCE, ARRAY 1
REFERENCE, ARRAY 2
TEST, ARRAY 2
More Arrays Simulaneously Preprocessing
More Arrays Simulaneously
Additive error and non linear effects undermine application of ANOVA
Preprocessing
ŷ
ŷŷ
ŷ
TEST, ARRAY 1
REFERENCE, ARRAY 1
REFERENCE, ARRAY 2
TEST, ARRAY 2
More Arrays Simulaneously Preprocessing
Lowess
ijnmkijkinimnjiijnmk GCRGAGDACGy
99 % confidence interval based on 100 genes, 3000 bootstraps
retained 370 genes (62 T-test p value < 0.01)
Bootstrap analysis
ID Rat_1 Rat_2 Rat_3 Rat_4 p D_GC_effects285 -3.31674 -3.20904 -2.08115 -1.62183 0.008818 -2.577397
1076 -1.39327 -2.04573 -1.85822 -2.42609 0.002899 -2.1754383755 -0.81029 -1.50631 -0.99613 -1.40283 0.005643 -1.245061
Preprocessing
Methods tested on pygmee dataset 3750 genes
1. ANOVA 99 % CI
2. ANOVA 95 % CI
3. SAM
4. T-test
5. Fold test
Retained 360 genes
Construct for each gene a binary profile 1 1 1 1 1
Hierarchically cluster genes based on this profile
methodsComparison
Only 8 genes retained by all methods
methodsmethodsComparison
methodsComparison
• Latin Square (mouse data set)
• Reference: normal mouse• Condition: pygmee mouse• Two experiments C=1, C=2 reflects two sample time points• 2 batches: not all genes of the genome on one array
A 1, C 1 B1
Test = R
Ref = G
A 2, C 1 B1
Test = G
Ref = R
A 5, C 2 B1
Test = R
Ref = G
A 6, C 2 B1
Test = G
Ref = R
A 3, C 1 B2
Test = R
Ref = G
A 4, C 1 B2
Test = R
Ref = G
A 7, C 2
B2
Test = R
Ref = G
A 8, C 2 B2
Test = G
Ref = R
Transcript profiling Experiment Design