Introduction to microarray technology and analysis
Carol BultAssociate Professor
The Jackson [email protected]
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently harder.
Measuring Gene Expression
Central Assumption of Gene Expression Microarrays
The level of a given mRNA is positively correlated with the expression of the associated protein. Higher mRNA levels mean higher protein
expression, lower mRNA means lower protein expression
Other factors: Protein degradation, mRNA degradation,
polyadenylation, codon preference, translation rates, alternative splicing, translation lag…
Principal Uses of Microarrays
Genome-scale gene expression analysis Differential gene expression between two (or
more) sample types Responses to environmental factors Disease processes (e.g. cancer) Effects of drugs Identification of genes associated with clinical
outcomes (e.g. survival)
Microarray example: Biomarker identification - lung cancer
SamplesSamples
Gen
eG
en
ess
Garber, Troyanskaya et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS 2001, 98(24):13784-9.
60
Cu
m.
Su
rviv
al
Time (months)
0
.2
.4
.6
.8
1
0 10 20 30 40 50
Cum. Survival (Group 3)
Cum. Survival (Group 2)
Cum. Survival (Group 1)
p = 0.002for Gr. 1 vs. Gr. 3
Data partitioning clinically important: Patient survival for lung cancer subgroups
Garber, Troyanskaya et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS 2001, 98(24):13784-9.
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
Technology basics Microarrays are composed of short, specific DNA
sequences attached to a glass or silicon slide at high density
A microarray works by exploiting the ability of an mRNA molecule to bind specifically to, or hybridize, the DNA template from which it originated
RNA or DNA from the sample of interest is fluorescently-labeled so that relative or absolute abundances can be quantitatively measured
Two color vs single color
Bakel and Holstege. 2007. http://www.cell-press.com/misc/page?page=ETBR
Other applications of microarray technology
(besides measuring gene expression)
DNA copy number analysis SNP analysis chIP-chip (interaction data) Competitive growth assays …
Major technologies cDNA probes (> 200 nt), usually
produced by PCR, attached to either nylon or glass supports
Oligonucleotides (25-80 nt) attached to glass support
Oligonucleotides (25-30 nt) synthesized in situ on silica wafers (Affymetrix)
Probes attached to tagged beads
cDNA Microarray Design
Probe selectionNon-redundant set of probes
Includes genes of interest to project
Corresponds to physically available clones
Chip layoutGrouping of probes by function
Correspondence between wells in microtiter plates and spots on the chip
Building the chip
Ngai Lab arrayer , UC Berkeley
Print-tip head
http://transcriptome.ens.fr/sgdb/presentation/principle.php
Example dual channel cDNA array results
Affymetrix GeneChips
Probes are oligos synthesized in situ using a photolithographic approach
There are at least 5 oligos per cDNA, plus an equal number of negative controls
The apparatus requires a fluidics station for hybridization and a special scanner
Only a single fluorochrome is used per hybridization
http://genome.ucsc.edu/cgi-bin/hgTracks
There may be 5,000-100,000 probe sets per chipA probe set = 11-20 PM, MM pairs
Affy
http://www.weizmann.ac.il/home/ligivol/pictures/system.jpg
Interpreting Affymetrix OutputPerfect Match/Mismatch Strategy
Each probe designed to be perfectly complementary to a target sequence, a partner probe is generated that is identical except for a single base mismatch in its center.
These probe pairs, called the Perfect Match probe (PM) and the Mismatch probe (MM), allow the quantitation and subtraction of signals caused by non-specific cross-hybridization.
The difference in hybridization signals between the partners serve as indicators of specific target abundance
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
Experimental Design
Bakel and Holstege. 2007. http://www.cell-press.com/misc/page?page=ETBR
Microarray Analysis: Controlling for the
Known Knowns and Unknown Unknowns
- Donald Rumsfeld, former Secretary of Defense
http://www.bioconductor.org/workshops/2003/NGFN03/experimental-design.pdf
Selected references
http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp
Best advice?Consult a statistician before you start!
Statistical Power
The probability that a test will reject a null hypothesis if it is falseType I and Type II errors
Type 1 – fail to accept the null hypothesis We say there is a difference in gene expression between
gene A and gene B when there really isn’t
Type 2- fail to reject the null hypothesis We say there is no difference in gene expression between
gene A and gene B when there actually is!
Power in Perspective Sample size
Number of units Effect size
Signal to noise Alpha level
Significance level Power
Likelihood of detecting a treatment effect if it is there
What are the 4 main components that determine what conclusions are drawn from a study?
Check out this pithy description of Statistical Power and Hypothesis Testinghttp://www.socialresearchmethods.net/kb/
power.php
MicroArray Image Analysis
Based on slides from Robin Liechti ([email protected])
Microarray analysis
Array construction, hybridisation, scanning
Quantitation of fluorescence signals
Data visualisation
Meta-analysis (clustering)
More visualisation
Technical
probe(on chip)
sample(labelled)
pseudo-colourimage
[image from Jeremy Buhler]
Experimental design
Track what’s on the chip which spot corresponds to which gene
Duplicate experimental spots reproducibility
Controls DNAs spotted on glass
positive probe (induced or repressed)
negative probe (bacterial genes on human chip)
oligos on glass or synthesised on chip (Affymetrix)
point mutants (hybridisation plus/minus)
Images from scanner
Resolution standard 10m [currently, max 5m] 100m spot on chip = 10 pixels in diameter
Image format TIFF (tagged image file format) 16 bit (65’536 levels of grey) 1cm x 1cm image at 16 bit = 2Mb (uncompressed) other formats exist e.g.. SCN (used at Stanford University)
Separate image for each fluorescent sample channel 1, channel 2, etc.
Images in analysis software
The two 16-bit images (cy3, cy5) are compressed into 8-bit images
Goal : display fluorescence intensities for both wavelengths using a 24-bit RGB overlay image
RGB image : Blue values (B) are set to 0 Red values (R) are used for cy5 intensities Green values (G) are used for cy3 intensities
Qualitative representation of results
Images : examples
cy3
cy5 Spot color Signal strength Gene expression
yellow Control = perturbed unchanged
red Control < perturbed induced
green Control > perturbed repressed
Pseudo-color overlay
Processing of images
Addressing or gridding Assigning coordinates to each of the spots
Segmentation Classification of pixels either as foreground or as
background Intensity extraction (for each spot)
Foreground fluorescence intensity pairs (R, G) Background intensities Quality measures
File or archive your e-mail on your own computer
ScanAlyze
Parameters to address the spots positions
Separation between rows and columns of grids
Individual translation of grids Separation between rows and
columns of spots within each grid Small individual translation of
spots Overall position of the array in the
image
Addressing (I) The basic structure of the images
is known (determined by the arrayer)
Addressing (II)
The measurement process depends on the addressing procedure
Addressing efficiency can be enhanced by allowing user intervention (slow!)
Most software systems now provide for both manual and automatic gridding procedures
Segmentation (I)
Classification of pixels as foreground or background -> fluorescence intensities are calculated for each spot as measure of transcript abundance
Production of a spot mask : set of foreground pixels for each spot
Segmentation (II) Segmentation methods :
Fixed circle segmentationAdaptive circle segmentationAdaptive shape segmentationHistogram segmentation
Fixed circle ScanAlyze, GenePix, QuantArray
Adaptive circle GenePix, Dapple
Adaptive shape Spot, region growing and watershed
Histogram method
ImaGene, QuantArraym DeArray and adaptive thresholding
Fixed circle segmentation Fits a circle with a constant diameter to
all spots in the image Easy to implement The spots need to be of the same
shape and size
Bad example !
Adaptive circle segmentation The circle diameter is
estimated separately for each spot
Dapple finds spots by detecting edges of spots (second derivative)
Problematic if spot exhibits oval shapes
Adaptive shape segmentation Specification of starting points or seeds
Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region.
Histogram segmentation Uses a target mask chosen to be
larger than any other spot Foreground and background
intensity are determined from the histogram of pixel values for pixels within the masked area
Example : QuantArray Background : mean between 5th
and 20th percentile Foreground : mean between 80th
and 95th percentile Unstable when a large target mask
is set to compensate for variation in spot size Bkgd Foreground
Information extraction
Spot intensity
The total amount of hybridization for a spot is proportional to the total fluorescence at the spot
Spot intensity = sum of pixel intensities within the spot mask
Since later calculations are based on ratios between cy5 and cy3, we compute the average* pixel value over the spot mask
*alternative : use ratios of medians instead of means
Background intensity
Motivation : spot’s measured intensity includes a contribution of non-specific hybridization and other chemicals on the glass
Fluorescence from regions not occupied by DNA should by different from regions occupied by DNA -> could be interesting to use local negative controls (spotted DNA that should not hybridize)
Different background methods :Local background, morphological opening, constant background, no adjustment
Local background Focusing on small regions surrounding the spot mask. Median of pixel values in this region
Most software package implement such an approach
ScanAlyze ImaGene Spot, GenePix
By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure
Morphological opening (spot) Applied to the original images R and G
Use a square structuring element with side length at least twice as large as the spot separation distance
Remove all the spots and generate an image that is an estimate of the background for the entire slide
For individual spots, the background is estimated by sampling this background image at the nominal center of the spot
Lower background estimate and less variable
Constant background
Global method which subtracts a constant background for all spots
Some findings suggests that the binding of fluorescent dyes to ‘negative control spots’ is lower than the binding to the glass slide
-> More meaningful to estimate background based on a set of negative control spots If no negative control spots : approximation of the
average background = third percentile of all the spot foreground values
No adjustment
Do not consider the background
Quality measures (-> Flag)
How good are foreground and background measurements ? Variability measures in pixel values within each spot mask Spot size Circularity measure Relative signal to background intensity b-value : fraction of background intensities less than the median
foreground intensity p-score : extend to which the position of a spot deviates from a
rigid rectangular grid
Based on these measurements, one can flag a spot
Summary The choice of background
correction method has a larger impact on the log-intensity ratios than the segmentation method used
The morphological opening method provides a better estimate of background than other methods Low within- and between-slide
variability of the log2 R/G Background adjustment has a
larger impact on low intensity spots
Spot, GenePix
ScanAlyze
M = log2 R/G
A = log2 √(R•G)
Selected references
Yang, Y. H., Buckley, M. J., Dudoit, S. and Speed, T. P. (2001), ‘Comparisons of methods for image analysis on cDNA microarray data’. Technical report #584, Department of Statistics, University of California, Berkeley.http://www.stat.berkeley.edu/users/terry/zarray/Html/papersindex.html
Yang, Y. H., Buckley, M. J. and Speed, T. P. (2001), ‘Analysis of cDNA microarray images’. Briefings in bioinformatics, 2 (4), 341-349.Excellent review in concise format!
http://pbil.univ-lyon1.fr/library/limma/doc/usersguide.html
Download the limma package and work through the Swirl zebrafish example.
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
Normalization - two problemsI. How do we detect biases?
Which genes should we use for estimating biases among chips/channels?
II. How do we remove the biases?
Why normalize?
Microarray data have significant systematic variation both within arrays and between arrays that is not true biological variation Accurate comparison of genes’ relative expression within and across conditions requires normalization of effects Sources of variation:
Spatial location on the array Dye biases which vary with spot intensity Plate origin Printing/spotting quality Experimenter
Why is normalization important?
Experiment:Comparison of gene expression response in mouse heart and kidney in response to drug
Source: http://www.partek.com
Most biological effects are swamped by systematic effects!
Other Sources of Systematic Bias
Individual Factors Print (20% - 30%) Experimenter (20%
- 30%) Organism (3% -
10%) Date (5%) Software (2%) Number of tips (3%)
Interactions Print - Experimenter
(40%) Print - Date (40%) Experimenter - Date
(40%)
(slide from Catherine Ball)
(based on ~4,600 experiments in Stanford Microarray Database analyzed by ANOVA)
KO #8
Probes: ~6,000 cDNAs, including 200 related to lipid metabolism. Arranged in a 4x4 array of 19x21 sub-arrays.
Clearly visible plate effects
Spatial Biases
Solution: spatial background estimation/subtraction
Spatial plots: background from two slides
Highlighting extreme log ratios
Top (black) and bottom (green) 5% of log ratios
Pin group (sub-array) effects
Boxplots of log ratios by pin groupLowess lines through points from pin groups
Boxplots and highlighting pin group effects
Clear example of spatial bias
Print-tip groups
Lo
g-r
ati o
s
Time of printing effects
Green channel intensities (log2G). Printing over 4.5 days.The previous slide depicts a slide from this print run.
spot number
Normalization in a nutshell Goal is to measure the ratios of gene expression levels, (ratio)i =
Ri/GiWhere Ri/Gi are, respectively, the measured intensities for the ith
spot In a self hybridzation, we would expect all ratios to be equal to
one:Ri/Gi = 1 for all i. But they probably won’t be…
Why? noise (systematics bias) signal (true differences)
Normalization brings appropriate ratios closer to 1
Ratio Histogram
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Ratio
Fre
qu
ency
The Starting Point: The Ratio (2-color arrays)
Log(ratio) Histogram
0
500
1000
1500
2000
2500
3000
-2 -1.8
-1.6
-1.4
-1.2 -1 -0
.8-0
.6-0
.4-0
.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Log(ratio)
Fre
qu
ency
Log ratios treat up- and down-regulated genes equally
log2(1) = 0 log2(2) = 1 log2(1/2) = -1(two-color arrays)
A note about Affymetrix (1-color) pre-processing
Log transform
Typical Affymetrix probe intensity distribution
After log-transform
Normalization methods
Which Genes to use for bias detection? 1. All genes on the chip
Assumption: Most of the genes are equally expressed in the compared samples, the proportion of the differential genes is low (<20%).
Limits: Not appropriate when comparing highly
heterogeneous samples (different tissues) Not appropriate for analysis of ‘dedicated chips’
(apoptosis chips, inflammation chips etc)
Which Genes to use for bias detection? 2. Housekeeping genes
• Assumption: based on prior knowledge a set of genes can be regarded as equally expressed in the compared samples
• Affy novel chips: ‘normalization set’ of 100 genes
• NHGRI’s cDNA microarrays: 70 "house-keeping" genes set
• Limits: The validity of the assumption is questionable Housekeeping genes are usually expressed at high
levels, not informative for the low intensities range
Which Genes to use for bias detection? 3. Spiked-in controls from other organism,
over a range of concentrations • Limits:
low number of controls- less robust Can’t detect biases due to differences in RNA extraction
protocols
4. “Invariant set”• Trying to identify genes that are expressed at
similar levels in the compared samples without relying on any prior knowledge:
Rank the genes in each chip according to their expression level
Find genes with small change in ranks
1. Global normalization (Scaling) A single normalization factor (k) is computed for
balancing chips\channels: Xi
norm = k*Xi or
log2 R/G log2 R/G – c (2-color) Multiplying intensities by this factor equalizes the
mean (median) intensity among compared chips Assumption: Total RNA (mass) used is same for both
samples. So, averaged across thousands of genes, total
hybridization should be the same for both samples.
Global Normalization (1-color, e.g. Affymetrix)
Before After
Xinorm = k*Xi
Global Normalization (2-color)
Un-normalized
Normalized
Frequ
enc
y
0
100
200
300
400
500
600
700
-8 -6 -4 -1 1 4 6 0
100
200
300
400
500
600
700
-7.7 -5.2 -2.8 -0.3 2.2 4.6 7.1
Log-ratios
log2 R/G log2 R/G – c where c = log2 (∑Ri/ ∑Gi)
2. Intensity-dependent normalization (Yang, Speed)
(Lowess – local linear fit)
Compensate for intensity-dependent biases
Detect Intensity-dependent Biases: M vs A plots (also called R-I plot)
X axis: A – average intensityA = 0.5*log(Cy3*Cy5)
Y axis: M – log ratioM = log(Cy3/Cy5)
Intensity-dependent bias
A
M = log(Cy3/Cy5)
Low intensities
M<0: Cy3<Cy5
High intensities
M>0: Cy3>Cy5
* Global normalization cannot remove intensity-dependent biases
A
We expect the M vs A plot to look like:
M =
lo
g(C
y3
/Cy5
)
LOWESS (Locally Weighted Scatterplot Smoothing)
• Local linear regression model
• Tri-cube weight function
• Least Squares
Estimated values of log2(Cy5/Cy3) as function of log10(Cy3*Cy5)
A note about Affymetrix (1-color) pre-processing
Two “standard” methods MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares PM
and MM probes) RMA by Speed group (UC Berkeley) (ignores MM probes)
within-chip cross-chip sequence specific
background correction
within-probe setaggregation of intensity values
Normalization – Thoughts
There are many different ways to normalize dataGlobal median, LOWESS, LOESS, etcBy print tip, spatial, etc
Choose one wisely BUT: don’t expect it to fix bad data!
Won’t make up for lack of replicatesWon’t make up for horrible slides
For next time.. Read Quackenbush paper on normalization Look up the paper on Robust Multichip
Averaging (RMA) out of Terry Speed’s lab What is meant by least squares? Visit the Gene Expression Omnibus (GEO)
resource at NCBI and explore what is there If you aren’t familiar with the statistical
computing environment, R, look it up on the web Look up MeV (multi-experiment viewer) on the
web.
File or archive your e-mail on your own computer
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
Analysis
Microarray Data FlowMicroarray experiment Image
Analysis
Database
Data Selection & Missing value estimation
Data Matrix
UnsupervisedAnalysis – clustering
Networks & Data Integration
Supervised Analysis
Normalization & Centering
Decomposition techniques
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
Interpretation
Microarray data on the Web
Several initiatives to create “unified” databases EBI: ArrayExpress
NCBI: Gene Expression Omnibus
Normalization - tools
Bioconductor (both Affymetrix and cDNA): Packages in R language
dChip (Affymetrix): Quantile, Invariant set
MAANOVA Microarray ANOVA analysis
Normalization is typically provided in microarray vendor’s software/core facilities but you should always understand the data you’re working with
How has your data been processed? Are there any lingering effects?