hww gene expression experiments: how? why? what’s the problem?

65
HWW Gene Expression Experiments: How? Why? What’s the problem?

Post on 21-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: HWW Gene Expression Experiments: How? Why? What’s the problem?

HWW Gene Expression Experiments:

How?Why?

What’s the problem?

Page 2: HWW Gene Expression Experiments: How? Why? What’s the problem?

High Throughput Experiments

Bioinformatics

FunctionalGenomics

Page 3: HWW Gene Expression Experiments: How? Why? What’s the problem?

DNA Hybridization

• The principle: have two denatured DNA strands bond together, then check double strand amount (florescent dye, radioactive label)

• “Traditional”: Southern/Northern/Western Blot

• The great advance: micro array DNA chips – automation, material eng., computer aided (including algorithmic solutions)

Page 4: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

HistoryHistory

cDNA microarrays have evolved from Southern blots, with clone libraries gridded out on nylon membrane filters being an important and still widely used intermediate. Things took off with the introduction of non-porous solid supports, such as glass - these permitted miniaturization - and fluorescence based detection. Currently, about 20,000 cDNAs can be spotted onto a microscope slide. The other, Affymetrix technology can produce arrays of 100,000 oligonucleotides on a silicon chip.

Page 5: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

THE PROCESSTHE PROCESSBuilding the Chip:

MASSIVE PCR PCR PURIFICATION and PREPARATION

PREPARING SLIDES PRINTING

Preparing RNA:

CELL CULTURE AND HARVEST

RNA ISOLATION

cDNA PRODUCTION

Hybing the Chip:POST PROCESSING

ARRAY HYBRIDIZATION

PROBE LABELING

DATA ANALYSIS

Page 6: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

MASSIVE PCR PCR PURIFICATION and PREPARATION

PREPARING SLIDES

PRINTING

Building the Chip:

Full yeast genome = 6,500 reactions IPA precipitation +EtOH

washes + 384-well format

The arrayer: high precision spotting device capable of printing 10,000 products in 14 hrs, with a plate change every 25 mins

Polylysine coating for adhering PCR products to glass slides

POST PROCESSING

Chemically converting the positive polylysine surface to prevent non-specific hybridization

Page 7: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Preparing RNA:

CELL CULTURE AND HARVEST

RNA ISOLATION

cDNA PRODUCTION

Designing experiments to profile conditions/perturbations/mutations and carefully controlled growth conditions

RNA yield and purity are determined by system. PolyA isolation is preferable but total RNA is useable. Two RNA samples are hybridized/chip.

Single strand synthesis or amplification of RNA can be performed. cDNA production includes incorporation of Aminoallyl-dUTP.

Page 8: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Hybing the Chip:

ARRAY HYBRIDIZATION

PROBE LABELING

DATA ANALYSIS

Cy3 and Cy5 RNA samples are simultaneously hybridized to chip. Hybs are performed for 5-12 hours and then chips are washed.

Two RNA samples are labelled with Cy3 or Cy5 monofunctional dyes via a chemical coupling to AA-dUTP. Samples are purified using a PCR cleanup kit.

Ratio measurements are determined via quantification of 532 nm and 635 nm emission values. Data are uploaded to the appropriate database where statistical and other analyses can then be performed.

Page 9: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Printing MicroarraysPrinting Microarrays• Print Head• Plate Handling• XYZ positioning

– Repeatability & Accuracy– Resolution

• Environmental Control– Humidity – Dust

• Instrument Control• Sample Tracking Software

Page 10: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Ngai Lab arrayer , UC Berkeley

Page 11: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Microarray GridderMicroarray Gridder

Page 12: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Printing ApproachesPrinting Approaches

Non - Contact • Piezoelectric dispenser• Syringe-solenoid ink-jet dispenser

Contact (using rigid pin tools, similar to filter array)

• Tweezer• Split pin• Micro spotting pin

Page 13: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Micro Spotting pin

Micro Spotting pin

Page 14: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Page 15: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Practical ProblemsPractical Problems

—Surface chemistry: uneven surface may lead to high background.

—Dipping the pin into large volume -> pre-printing to drain off excess sample.

—Spot variation can be due to mechanical difference between pins. Pins could be clogged during the printing process.

—Spot size and density depends on surface and solution properties.

—Pins need good washing between samples to prevent sample carryover.

Page 16: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Post Processing ArraysPost Processing Arrays

Protocol for Post Processing Microarrays

Hydration/Heat Fixing

1. Pick out about 20-30 slides to be processed.

2. Determine the correct orientation of slide, and if necessary, etch label on lower left corner of array side

3. On back of slide, etch two lines above and below center of array to designate array area after processing

4. Pour 100 ml 1X SSC into hydration tray and warm on slide warmer at medium setting

5. Set slide array side down and observe spots until proper hydration is achieved.

6. Upon reaching proper hydration, immediately snap dry slide

7. Place slides in rack.

Page 17: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Practical Problems 1

• Comet Tails• Likely caused by

insufficiently rapid immersion of the slides in the succinic anhydride blocking solution.

Page 18: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Practical Problems 2

Page 19: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Practical Problems 3

High Background• 2 likely causes:

– Insufficient blocking.

– Precipitation of the

labeled probe.

Weak Signals

Page 20: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Practical Problems 4

Spot overlap:Likely cause: toomuch rehydrationduring post -processing.

Page 21: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Practical Problems 5

DustDust

Page 22: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Steps in Images Processing

1. Addressing: locate centers

2. Segmentation: classification of pixels either as signal or background. using seeded region growing).

3. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

Page 23: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Steps in Image Processing

• Spot Intensities– mean (pixel intensities).– median (pixel intensities).

– Pixel variation (IQR of log (pixel

intensities).• Background values

– Local

– Morphological opening

– Constant (global)

– None

• Quality Information

Signal

Background

3. Information Extraction

Page 24: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Addressing

This is the process of assigning coordinates to each of the spots.

Automating this part of the procedure permits high throughput analysis.

4 by 4 grids19 by 21 spots per grid

Page 25: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Addressing

Registration

Registration

Page 26: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Problems in automatic addressing

Misregistration of the red and green channels

Rotation of the array in the image

Skew in the arrayRotation

Rotation

Page 27: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Segmentation methods• Fixed circles• Adaptive Circle• Adaptive Shape

– Edge detection.– Seeded Region Growing. (R. Adams and L.

Bishof (1994) :Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region.

• Histogram Methods– Adaptive threshold.

Page 28: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Examples of algorithms and software implementation

Methods Software / algorithms

Fixed Circle ScanAlyze, GenePix, QuantArray

Adaptive Circle GenePix

Adaptive Shape Edging and region growing.

Histogram Method QuantArray and adaptivethresholding.

Page 29: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Limitation of fixed circle method

SRG Fixed Circle

Page 30: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Limitation of circular segmentation

—Small spot—Not circular

Results from SRG

Page 31: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Information Extraction

—Spot Intensities—mean (pixel intensities).—median (pixel intensities).

—Background values—Local —Morphological opening—Constant (global)—None

—Quality Information

Take the average

Page 32: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Local Backgrounds

Page 33: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Summary of analysis possibilitiesSummary of analysis possibilitiesDetermine genes which are differentially expressed (this task

can take many forms depending on replication, etc)

Connect differentially expressed genes to sequence databases and perhaps carry out further analyses, e.g. searching for common upstream motifs

Overlay differentially expressed genes on pathway diagrams

Relate expression levels to other information on cells, e.g. known tumour types

Define subclasses (clusters) in sets of samples (e.g. tumours)

Identify temporal or spatial trends in gene expression

Seek roles for genes on the basis of patterns of co-expression

……..much more

Many challenges: transcriptional regulation involves redundancy, feedback, amplification, .. non-linearity

Page 34: HWW Gene Expression Experiments: How? Why? What’s the problem?

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research,

Biological Question

Sample preparationMicroarray

Life Cycle

Data Analysis & Modeling

Microarray Reaction

MicroarrayDetection

Taken from Schena & Davis

Page 35: HWW Gene Expression Experiments: How? Why? What’s the problem?

Oligonucleotide Arrays

Page 36: HWW Gene Expression Experiments: How? Why? What’s the problem?

Schadt et al., Journal of Cellular Biochemistry, 2000

Page 37: HWW Gene Expression Experiments: How? Why? What’s the problem?

Oligonucleotide Arrays Tech.

• ~20 probes per “gene”, 25bases each*

• Probe size: 24x24 micron (contain ~106 copies of the probe)

• Probe is either a Perfect Match (PP) or a Miss Match (MM)

• MM: – usually at the center of the probe– Aim: to give estimate on the random hybrd.

Page 38: HWW Gene Expression Experiments: How? Why? What’s the problem?

Motivation

• Data is noisy, missing values.• Each array is scanned separately, in different

settings

→ To extract biological meaningful results we need:

1. Good expression estimations

2. Scale/Normalize across arrays

Page 39: HWW Gene Expression Experiments: How? Why? What’s the problem?

What we need

• Image segmentation• Background/Gradient correction• Artifact detection

• Allow array to array comparison (scale/normalize)• Assess gene presence (quantitative “Measure”)• Find differentially expressed genes

Page 40: HWW Gene Expression Experiments: How? Why? What’s the problem?

Why isn’t “Normalization” Easy?

• No ability to read mRNA level directly

• Various noise factors → hard to model exactly.

• Variable biological settings, experiment dependent.

• Need to differentiate between changes caused by biological signal from noise artifacts.

Page 41: HWW Gene Expression Experiments: How? Why? What’s the problem?

Variability Sources

1. Real Biology – 1. Biological noise

2. Biological Signal

2. Sample preparation related

3. Technical dependent

Page 42: HWW Gene Expression Experiments: How? Why? What’s the problem?

dChip MBEI

• Based on several papers by Li & Wong (PNAS, 2001 vol 98 no.1 and others)

• Implemented on their freely available dChip software

• Model based: The estimation is based on a model of how the probe intensity values respond to changes of the expression levels of the gene

Page 43: HWW Gene Expression Experiments: How? Why? What’s the problem?

dChip Model

i is the array indexj is the probe index

is the baseline response of the probe due to non specific hybridization

is the additional rate of increase of the PM response

is the rate of increase of the MM response

Page 44: HWW Gene Expression Experiments: How? Why? What’s the problem?

dChip “Reduced” Model

Basic idea: Least square parameter estimation, iteratively fitting and

Page 45: HWW Gene Expression Experiments: How? Why? What’s the problem?

dChip “Reduced” Model

For one array, assume that the set has been learned from a large number of arrays, and therefore known and fixed

Given this set, the linear least square estimate for theta is

An approx. Std. can be computed for this estimator:

Page 46: HWW Gene Expression Experiments: How? Why? What’s the problem?

dChip “Reduced” Model

• Similarly, we regard the set as known, and compute std. for each phi

• We use these estimated Std. to find outlier and exclude them from the computation:

Page 47: HWW Gene Expression Experiments: How? Why? What’s the problem?

Dchip – Array outliers detection

Page 48: HWW Gene Expression Experiments: How? Why? What’s the problem?

Dchip – Probe outliers detection

Page 49: HWW Gene Expression Experiments: How? Why? What’s the problem?

Normalization/Scaling

• We saw how to get MBEI from dchip, i.e measure “quantitation “

• We still need to scale the different arrays:– Arrays usually differ in overall image

brightness (differ in time, place, exper. Cond….)

• This is usually done PRIOR to the “measure quantitation” manipulations (as dChip’s MBEI we just described).

Page 50: HWW Gene Expression Experiments: How? Why? What’s the problem?

Global – Normalization/Scaling• Suppose we have two arrays X,Y with values

x1…xM and y1 .. yM

• “Global” normalization (MAS 5): find the constant “a” such that

Which means:

When we have multiple arrays then we choose Y to be the avg. of all arrays or compute a such that sum_i (x_i) = constant

Better way: a(x) i.e adopt the fit parameter as a function of expression level ( as by dChip)

Page 51: HWW Gene Expression Experiments: How? Why? What’s the problem?

dChip – Normalization/Scaling

• Big question: Which gene to use for this scaling??

• There are various ways to choose the set:– “House keeping” genes (Affy. chips)– Spiked controls added in various stages of the

experiment, in a range of concentrations– Both of the above are very good in theory but (still)

not in practice (esp. in Affy chips)– The result: several approaches suggested on how to

use the set of genes tested in the experiments

• We’ll review dChip’s solution: The “Invariant set”

Page 52: HWW Gene Expression Experiments: How? Why? What’s the problem?

dChip “Invariant Set”

• Main idea:1. Initialize: set of probes P = all probes 2. Order the probes in both arrays by their expression

values3. Give each probe in each array an index according

to it’s relative expression order4. Find a set of probes P’ who’s relative order is similar

in both arrays 5. Set P = P’ and iterate from stage (2) until

convergence6. Use the resulting P to compute a piecewise linear

running median line as the normalization curve

Page 53: HWW Gene Expression Experiments: How? Why? What’s the problem?
Page 54: HWW Gene Expression Experiments: How? Why? What’s the problem?
Page 55: HWW Gene Expression Experiments: How? Why? What’s the problem?

Normalization Tools – Current State• Commonly Used:

– RMA by Speed Lab– dChip by Li & Wong– GeneChip = MAS5 (Affy. built in tool)

• “The Future”:– New Chip design (both Affy. And cDNA) with

better probes, better built in controls etc.– New algorithms – facilitating probes GC content

(gcRMA), location etc.– New MAS tool (this year ?) is also supposed to

incorporate RMA,dChip etc.

Page 56: HWW Gene Expression Experiments: How? Why? What’s the problem?

How to Measure Performance?

1. Theoretical Validation – use some theoretical assumptions and evaluate Statistical characteristics of the method at hand.

2. Experimental Validation – 1. Use public data sets to measure different aspects of

performance

2. Evaluate relevant characteristics on your data set. Design your data set accordingly (if possible)

Page 57: HWW Gene Expression Experiments: How? Why? What’s the problem?

A Benchmark for Affy. Expression Measures*

• Main Idea: Define a “universal” test set & test statistics

• Based on 3 publicly available spike in data sets• Tests for:

– Variability across replicate arrays– Response of GE measures to change in abundance of

RNA– Sensitivity of fold change measures to amount of

actual RNA sample– Accuracy of fold change as a measure of relative

expression– Usefulness of raw fold change score to detect

differential expressed genes

* Cope et al. Bioinformatics, 03 (Speed’s Lab)

Page 58: HWW Gene Expression Experiments: How? Why? What’s the problem?

MA Plot

M1 = X1 – X2

A = (X1 + X2)/ 2 Where Xi is the log2 of expression measure

Page 59: HWW Gene Expression Experiments: How? Why? What’s the problem?

Variance across replicates plot

Test Statistics: 1. Median std. 2. Avg. R2 (squared corr. coef.) between two replicates

Page 60: HWW Gene Expression Experiments: How? Why? What’s the problem?

Observed Expression vs. Nominal Expression Plots

Test Statistics: Fit a linear curve and compute1. linear fit slope (should be 1) 2. R2 to the linear fit

Page 61: HWW Gene Expression Experiments: How? Why? What’s the problem?

ROC Curves

• One of the chief uses of GE arrays is to identify differentially expressed genes

• ROC ( Receiver Operator Characteristic):A graphical representation of both Sens. and Spec. as a function of threshold value

• X axis: TPR (Sens.) • Y axis: FPR (1-Spec.)• In this case: Use fold change as the score,

knowing which probes are spiked or not..

Page 62: HWW Gene Expression Experiments: How? Why? What’s the problem?

FC ROC PlotsHere actual TP, FP numbers are used for the axes

Test Statistic: AUC (area under the graph)

Page 63: HWW Gene Expression Experiments: How? Why? What’s the problem?

FC ROC PlotsSame as before, but only for FC = 2 cases (harder)

Page 64: HWW Gene Expression Experiments: How? Why? What’s the problem?

The Benchmark – Bottom Line• 15 parameters used to test performace

• 3 “synthetic” spike in data sets

• Automatic submission and evaluation tool + comparative results at:www.biostat.jhsph.edu

Page 65: HWW Gene Expression Experiments: How? Why? What’s the problem?

Other Tests

• Evaluate separately normalization and expression measures techniques ( as by Huffman et al., Genome Biology, Vol. 3, 2002)

• How do we evaluate performance on our own, very specific, data??? ( hint: see next class..)