Download - Introduction to microarray technology and analysis

Introduction to microarray technology and analysis

Carol BultAssociate Professor

The Jackson [email protected]

Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently harder.

Measuring Gene Expression

Central Assumption of Gene Expression Microarrays

The level of a given mRNA is positively correlated with the expression of the associated protein. Higher mRNA levels mean higher protein

expression, lower mRNA means lower protein expression

Other factors: Protein degradation, mRNA degradation,

polyadenylation, codon preference, translation rates, alternative splicing, translation lag…

Principal Uses of Microarrays

Genome-scale gene expression analysis Differential gene expression between two (or

more) sample types Responses to environmental factors Disease processes (e.g. cancer) Effects of drugs Identification of genes associated with clinical

outcomes (e.g. survival)

Microarray example: Biomarker identification - lung cancer

SamplesSamples

Gen

eG

en

ess

Garber, Troyanskaya et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS 2001, 98(24):13784-9.

60

Cu

m.

Su

rviv

al

Time (months)

0

.2

.4

.6

.8

1

0 10 20 30 40 50

Cum. Survival (Group 3)



p = 0.002for Gr. 1 vs. Gr. 3

Data partitioning clinically important: Patient survival for lung cancer subgroups

Garber, Troyanskaya et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS 2001, 98(24):13784-9.

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

Technology basics Microarrays are composed of short, specific DNA

sequences attached to a glass or silicon slide at high density

A microarray works by exploiting the ability of an mRNA molecule to bind specifically to, or hybridize, the DNA template from which it originated

RNA or DNA from the sample of interest is fluorescently-labeled so that relative or absolute abundances can be quantitatively measured

Two color vs single color

Bakel and Holstege. 2007. http://www.cell-press.com/misc/page?page=ETBR

Other applications of microarray technology

(besides measuring gene expression)

DNA copy number analysis SNP analysis chIP-chip (interaction data) Competitive growth assays …

Major technologies cDNA probes (> 200 nt), usually

produced by PCR, attached to either nylon or glass supports

Oligonucleotides (25-80 nt) attached to glass support

Oligonucleotides (25-30 nt) synthesized in situ on silica wafers (Affymetrix)

Probes attached to tagged beads

cDNA Microarray Design

Probe selectionNon-redundant set of probes

Includes genes of interest to project

Corresponds to physically available clones

Chip layoutGrouping of probes by function

Correspondence between wells in microtiter plates and spots on the chip

Building the chip

Ngai Lab arrayer , UC Berkeley

Print-tip head

http://transcriptome.ens.fr/sgdb/presentation/principle.php

Example dual channel cDNA array results

Affymetrix GeneChips

Probes are oligos synthesized in situ using a photolithographic approach

There are at least 5 oligos per cDNA, plus an equal number of negative controls

The apparatus requires a fluidics station for hybridization and a special scanner

Only a single fluorochrome is used per hybridization

http://genome.ucsc.edu/cgi-bin/hgTracks

There may be 5,000-100,000 probe sets per chipA probe set = 11-20 PM, MM pairs

Affy

http://www.weizmann.ac.il/home/ligivol/pictures/system.jpg

Interpreting Affymetrix OutputPerfect Match/Mismatch Strategy

Each probe designed to be perfectly complementary to a target sequence, a partner probe is generated that is identical except for a single base mismatch in its center.

These probe pairs, called the Perfect Match probe (PM) and the Mismatch probe (MM), allow the quantitation and subtraction of signals caused by non-specific cross-hybridization.

The difference in hybridization signals between the partners serve as indicators of specific target abundance


Testing



Estimation

Experimental design

Image analysis

Normalization


Experimental Design

Bakel and Holstege. 2007. http://www.cell-press.com/misc/page?page=ETBR

Microarray Analysis: Controlling for the

Known Knowns and Unknown Unknowns

- Donald Rumsfeld, former Secretary of Defense

http://www.bioconductor.org/workshops/2003/NGFN03/experimental-design.pdf

Selected references

http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp

Best advice?Consult a statistician before you start!

Statistical Power

The probability that a test will reject a null hypothesis if it is falseType I and Type II errors

Type 1 – fail to accept the null hypothesis We say there is a difference in gene expression between

gene A and gene B when there really isn’t

Type 2- fail to reject the null hypothesis We say there is no difference in gene expression between

gene A and gene B when there actually is!

Power in Perspective Sample size

Number of units Effect size

Signal to noise Alpha level

Significance level Power

Likelihood of detecting a treatment effect if it is there

What are the 4 main components that determine what conclusions are drawn from a study?

Check out this pithy description of Statistical Power and Hypothesis Testinghttp://www.socialresearchmethods.net/kb/

power.php

MicroArray Image Analysis

Based on slides from Robin Liechti ([email protected])

Microarray analysis

Array construction, hybridisation, scanning

Quantitation of fluorescence signals

Data visualisation

Meta-analysis (clustering)

More visualisation

Technical

probe(on chip)

sample(labelled)

pseudo-colourimage

[image from Jeremy Buhler]

Experimental design

Track what’s on the chip which spot corresponds to which gene

Duplicate experimental spots reproducibility

Controls DNAs spotted on glass

positive probe (induced or repressed)

negative probe (bacterial genes on human chip)

oligos on glass or synthesised on chip (Affymetrix)

point mutants (hybridisation plus/minus)

Images from scanner

Resolution standard 10m [currently, max 5m] 100m spot on chip = 10 pixels in diameter

Image format TIFF (tagged image file format) 16 bit (65’536 levels of grey) 1cm x 1cm image at 16 bit = 2Mb (uncompressed) other formats exist e.g.. SCN (used at Stanford University)

Separate image for each fluorescent sample channel 1, channel 2, etc.

Images in analysis software

The two 16-bit images (cy3, cy5) are compressed into 8-bit images

Goal : display fluorescence intensities for both wavelengths using a 24-bit RGB overlay image

RGB image : Blue values (B) are set to 0 Red values (R) are used for cy5 intensities Green values (G) are used for cy3 intensities

Qualitative representation of results

Images : examples

cy3

cy5 Spot color Signal strength Gene expression

yellow Control = perturbed unchanged

red Control < perturbed induced

green Control > perturbed repressed

Pseudo-color overlay

Processing of images

Addressing or gridding Assigning coordinates to each of the spots

Segmentation Classification of pixels either as foreground or as

background Intensity extraction (for each spot)

Foreground fluorescence intensity pairs (R, G) Background intensities Quality measures

File or archive your e-mail on your own computer

ScanAlyze

Parameters to address the spots positions

Separation between rows and columns of grids

Individual translation of grids Separation between rows and

columns of spots within each grid Small individual translation of

spots Overall position of the array in the

image

Addressing (I) The basic structure of the images

is known (determined by the arrayer)

Addressing (II)

The measurement process depends on the addressing procedure

Addressing efficiency can be enhanced by allowing user intervention (slow!)

Most software systems now provide for both manual and automatic gridding procedures

Segmentation (I)

Classification of pixels as foreground or background -> fluorescence intensities are calculated for each spot as measure of transcript abundance

Production of a spot mask : set of foreground pixels for each spot

Segmentation (II) Segmentation methods :

Fixed circle segmentationAdaptive circle segmentationAdaptive shape segmentationHistogram segmentation

Fixed circle ScanAlyze, GenePix, QuantArray

Adaptive circle GenePix, Dapple

Adaptive shape Spot, region growing and watershed

Histogram method

ImaGene, QuantArraym DeArray and adaptive thresholding

Fixed circle segmentation Fits a circle with a constant diameter to

all spots in the image Easy to implement The spots need to be of the same

shape and size

Bad example !

Adaptive circle segmentation The circle diameter is

estimated separately for each spot

Dapple finds spots by detecting edges of spots (second derivative)

Problematic if spot exhibits oval shapes

Adaptive shape segmentation Specification of starting points or seeds

Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region.

Histogram segmentation Uses a target mask chosen to be

larger than any other spot Foreground and background

intensity are determined from the histogram of pixel values for pixels within the masked area

Example : QuantArray Background : mean between 5th

and 20th percentile Foreground : mean between 80th

and 95th percentile Unstable when a large target mask

is set to compensate for variation in spot size Bkgd Foreground

Information extraction

Spot intensity

The total amount of hybridization for a spot is proportional to the total fluorescence at the spot

Spot intensity = sum of pixel intensities within the spot mask

Since later calculations are based on ratios between cy5 and cy3, we compute the average* pixel value over the spot mask

*alternative : use ratios of medians instead of means

Background intensity

Motivation : spot’s measured intensity includes a contribution of non-specific hybridization and other chemicals on the glass

Fluorescence from regions not occupied by DNA should by different from regions occupied by DNA -> could be interesting to use local negative controls (spotted DNA that should not hybridize)

Different background methods :Local background, morphological opening, constant background, no adjustment

Local background Focusing on small regions surrounding the spot mask. Median of pixel values in this region

Most software package implement such an approach

ScanAlyze ImaGene Spot, GenePix

By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure

Morphological opening (spot) Applied to the original images R and G

Use a square structuring element with side length at least twice as large as the spot separation distance

Remove all the spots and generate an image that is an estimate of the background for the entire slide

For individual spots, the background is estimated by sampling this background image at the nominal center of the spot

Lower background estimate and less variable

Constant background

Global method which subtracts a constant background for all spots

Some findings suggests that the binding of fluorescent dyes to ‘negative control spots’ is lower than the binding to the glass slide

-> More meaningful to estimate background based on a set of negative control spots If no negative control spots : approximation of the

average background = third percentile of all the spot foreground values

No adjustment

Do not consider the background

Quality measures (-> Flag)

How good are foreground and background measurements ? Variability measures in pixel values within each spot mask Spot size Circularity measure Relative signal to background intensity b-value : fraction of background intensities less than the median

foreground intensity p-score : extend to which the position of a spot deviates from a

rigid rectangular grid

Based on these measurements, one can flag a spot

Summary The choice of background

correction method has a larger impact on the log-intensity ratios than the segmentation method used

The morphological opening method provides a better estimate of background than other methods Low within- and between-slide

variability of the log2 R/G Background adjustment has a

larger impact on low intensity spots

Spot, GenePix

ScanAlyze

M = log2 R/G

A = log2 √(R•G)

Selected references

Yang, Y. H., Buckley, M. J., Dudoit, S. and Speed, T. P. (2001), ‘Comparisons of methods for image analysis on cDNA microarray data’. Technical report #584, Department of Statistics, University of California, Berkeley.http://www.stat.berkeley.edu/users/terry/zarray/Html/papersindex.html

Yang, Y. H., Buckley, M. J. and Speed, T. P. (2001), ‘Analysis of cDNA microarray images’. Briefings in bioinformatics, 2 (4), 341-349.Excellent review in concise format!

http://pbil.univ-lyon1.fr/library/limma/doc/usersguide.html

Download the limma package and work through the Swirl zebrafish example.


Testing



Estimation

Experimental design

Image analysis

Normalization


Normalization - two problemsI. How do we detect biases?

Which genes should we use for estimating biases among chips/channels?

II. How do we remove the biases?

Why normalize?

Microarray data have significant systematic variation both within arrays and between arrays that is not true biological variation Accurate comparison of genes’ relative expression within and across conditions requires normalization of effects Sources of variation:

Spatial location on the array Dye biases which vary with spot intensity Plate origin Printing/spotting quality Experimenter

Why is normalization important?

Experiment:Comparison of gene expression response in mouse heart and kidney in response to drug

Source: http://www.partek.com

Most biological effects are swamped by systematic effects!

Other Sources of Systematic Bias

Individual Factors Print (20% - 30%) Experimenter (20%

- 30%) Organism (3% -

10%) Date (5%) Software (2%) Number of tips (3%)

Interactions Print - Experimenter

(40%) Print - Date (40%) Experimenter - Date

(40%)

(slide from Catherine Ball)

(based on ~4,600 experiments in Stanford Microarray Database analyzed by ANOVA)

KO #8

Probes: ~6,000 cDNAs, including 200 related to lipid metabolism. Arranged in a 4x4 array of 19x21 sub-arrays.

Clearly visible plate effects

Spatial Biases

Solution: spatial background estimation/subtraction

Spatial plots: background from two slides

Highlighting extreme log ratios

Top (black) and bottom (green) 5% of log ratios

Pin group (sub-array) effects

Boxplots of log ratios by pin groupLowess lines through points from pin groups

Boxplots and highlighting pin group effects

Clear example of spatial bias

Print-tip groups

Lo

g-r

ati o

s

Time of printing effects

Green channel intensities (log2G). Printing over 4.5 days.The previous slide depicts a slide from this print run.

spot number

Normalization in a nutshell Goal is to measure the ratios of gene expression levels, (ratio)i =

Ri/GiWhere Ri/Gi are, respectively, the measured intensities for the ith

spot In a self hybridzation, we would expect all ratios to be equal to

one:Ri/Gi = 1 for all i. But they probably won’t be…

Why? noise (systematics bias) signal (true differences)

Normalization brings appropriate ratios closer to 1

Ratio Histogram

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Ratio

Fre

qu

ency

The Starting Point: The Ratio (2-color arrays)

Log(ratio) Histogram

0

500

1000

1500

2000

2500

3000

-2 -1.8

-1.6

-1.4

-1.2 -1 -0

.8-0

.6-0

.4-0

.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Log(ratio)

Fre

qu

ency

Log ratios treat up- and down-regulated genes equally

log2(1) = 0 log2(2) = 1 log2(1/2) = -1(two-color arrays)

A note about Affymetrix (1-color) pre-processing

Log transform

Typical Affymetrix probe intensity distribution

After log-transform

Normalization methods

Which Genes to use for bias detection? 1. All genes on the chip

Assumption: Most of the genes are equally expressed in the compared samples, the proportion of the differential genes is low (<20%).

Limits: Not appropriate when comparing highly

heterogeneous samples (different tissues) Not appropriate for analysis of ‘dedicated chips’

(apoptosis chips, inflammation chips etc)

Which Genes to use for bias detection? 2. Housekeeping genes

• Assumption: based on prior knowledge a set of genes can be regarded as equally expressed in the compared samples

• Affy novel chips: ‘normalization set’ of 100 genes

• NHGRI’s cDNA microarrays: 70 "house-keeping" genes set

• Limits: The validity of the assumption is questionable Housekeeping genes are usually expressed at high

levels, not informative for the low intensities range

Which Genes to use for bias detection? 3. Spiked-in controls from other organism,

over a range of concentrations • Limits:

low number of controls- less robust Can’t detect biases due to differences in RNA extraction

protocols

4. “Invariant set”• Trying to identify genes that are expressed at

similar levels in the compared samples without relying on any prior knowledge:

Rank the genes in each chip according to their expression level

Find genes with small change in ranks

1. Global normalization (Scaling) A single normalization factor (k) is computed for

balancing chips\channels: Xi

norm = k*Xi or

log2 R/G log2 R/G – c (2-color) Multiplying intensities by this factor equalizes the

mean (median) intensity among compared chips Assumption: Total RNA (mass) used is same for both

samples. So, averaged across thousands of genes, total

hybridization should be the same for both samples.

Global Normalization (1-color, e.g. Affymetrix)

Before After

Xinorm = k*Xi

Global Normalization (2-color)

Un-normalized

Normalized

Frequ

enc

y

0

100

200

300

400

500

600

700

-8 -6 -4 -1 1 4 6 0

100

200

300

400

500

600

700

-7.7 -5.2 -2.8 -0.3 2.2 4.6 7.1

Log-ratios

log2 R/G log2 R/G – c where c = log2 (∑Ri/ ∑Gi)

2. Intensity-dependent normalization (Yang, Speed)

(Lowess – local linear fit)

Compensate for intensity-dependent biases

Detect Intensity-dependent Biases: M vs A plots (also called R-I plot)

X axis: A – average intensityA = 0.5*log(Cy3*Cy5)

Y axis: M – log ratioM = log(Cy3/Cy5)

Intensity-dependent bias

A

M = log(Cy3/Cy5)

Low intensities

M<0: Cy3<Cy5

High intensities

M>0: Cy3>Cy5

* Global normalization cannot remove intensity-dependent biases

A

We expect the M vs A plot to look like:

M =

lo

g(C

y3

/Cy5

)

LOWESS (Locally Weighted Scatterplot Smoothing)

• Local linear regression model

• Tri-cube weight function

• Least Squares

Estimated values of log2(Cy5/Cy3) as function of log10(Cy3*Cy5)

A note about Affymetrix (1-color) pre-processing

Two “standard” methods MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares PM

and MM probes) RMA by Speed group (UC Berkeley) (ignores MM probes)

within-chip cross-chip sequence specific

background correction

within-probe setaggregation of intensity values

Normalization – Thoughts

There are many different ways to normalize dataGlobal median, LOWESS, LOESS, etcBy print tip, spatial, etc

Choose one wisely BUT: don’t expect it to fix bad data!

Won’t make up for lack of replicatesWon’t make up for horrible slides

For next time.. Read Quackenbush paper on normalization Look up the paper on Robust Multichip

Averaging (RMA) out of Terry Speed’s lab What is meant by least squares? Visit the Gene Expression Omnibus (GEO)

resource at NCBI and explore what is there If you aren’t familiar with the statistical

computing environment, R, look it up on the web Look up MeV (multi-experiment viewer) on the

web.

File or archive your e-mail on your own computer


Testing



Estimation

Experimental design

Image analysis

Normalization


Analysis

Microarray Data FlowMicroarray experiment Image

Analysis

Database

Data Selection & Missing value estimation

Data Matrix

UnsupervisedAnalysis – clustering

Networks & Data Integration

Supervised Analysis

Normalization & Centering

Decomposition techniques


Testing



Estimation

Experimental design

Image analysis

Normalization


Interpretation

Microarray data on the Web

Several initiatives to create “unified” databases EBI: ArrayExpress

NCBI: Gene Expression Omnibus

Normalization - tools

Bioconductor (both Affymetrix and cDNA): Packages in R language

dChip (Affymetrix): Quantile, Invariant set

MAANOVA Microarray ANOVA analysis

Normalization is typically provided in microarray vendor’s software/core facilities but you should always understand the data you’re working with

How has your data been processed? Are there any lingering effects?

Download - Introduction to microarray technology and analysis

Top Related