microarray analysis quantitation of gene expression expression data to networks bio520...

Microarray analysis

Quantitation of Gene Expression

Expression Data to Networks

BIO520 Bioinformatics Jim Lund

Reading: Ch 16

Microarray data

• Image quantitation.• Normalization• Find genes with significant

expression differences• Annotation• Clustering, pattern analysis,

network analysis

Sources of Non-Biological Variation

• Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation

• Differences in the amount of labeled cDNA hybridized to each channel in a microarray experiment (Channel is used to refer to a combination of a dye and a slide.)

• Variation across replicate slides

• Variation across hybridization conditions

• Variation in scanning conditions

• Variation among technicians doing the lab work.

Factors which impact on the signal level

• Amount of mRNA

• Labeling efficiencies

• Quality of the RNA

• Laser/dye combination

• Detection efficiency of photomultiplier or CCD

HelaHepG2

A = (Log Green + Log Red) / 2

M =

Lo

g (

Red

- L

og

Gre

en

M vs. A Plot

M v A plots of chip pairs: before normalization

M v A plots of chip pairs: after quantile normalization

Types of normalization

• To total signal (linear normalization)• LOESS (LOcally WEighted polynomial

regreSSion).• To “house keeping genes”• To genomic DNA spots (Research

Genetics) or mixed cDNA’s• To internal spikes

Microarray analysis

• Data exploration: expression of gene X?

• Statistical analysis: which genes show large, reproducible changes?

• Clustering: grouping genes by expression pattern.

• Knowledge-based analysis: Are amine synthesis genes involved in this experiment?

HelaHepG2

Fold change: the crudest method of finding differentially expressed genes

>2-fold expression change

>2-fold expression change

What do we mean by differentially expressed?

• Statistically, our gene is different from the other genes.

Num

ber

of g

enes

Log ratio

Distribution of average ratios for all genes

Probability of a given

Value of the ratio

Distribution of measurements for gene of interest

Finding differentially expressed genes

What affects our certainty that a gene is up or down-regulated?

• Number of sample points• Difference in means• Standard deviations of

sample

SampleA

SampleB

Pro

be S

igna

l

Practical views on statistics

• With appropriate biological replicates, it is possible to select statistically meaningful genes/patterns.

• Sensitivity and selectivity are inversely related - e.g. increased selection of true positives WILL result in more false positive and less false negatives.

• False negatives are lost opportunities, false positives cost $’s and waste time.

• A typical set of experiments treated with conservative statistics typically results in more genes/pathways/patterns than one can sensibly follow - so use conservative statistics to protect against false positives when designing follow-on experiments.

Statistical Tests

• Student’s t-test

– Correct for multiple testing! (Holm-Bonferroni)

• False discovery rate.

• Significance Analysis of Microarrays (SAM)

– http://www-stat.stanford.edu/~tibs/SAM/

• ANOVA

• Principal components analysis

• Special methods for periodic patterns in data.

http://info.med.yale.edu/cooley/research%20focus.html

Volcano plot: log(expr) vs p-value

Log(fold change)

p-va

lue

Scatter plot showing genes with significant p-values

Pattern finding

• In many cases, the patterns of differential expression are the target (as opposed to specific genes)– Clustering or other approaches for pattern

identification - find genes which behave similarly across all experiments or experiments which behave similarly across all genes

– Classification - identify genes which best distinguish 2 or more classes.

• The statistical reliability of the pattern or classifier is still an issue and similar considerations apply - e.g. cluster analysis of random noise will produce clusters which will be meaningless….

What is clustering?

• Group similar objects together.

– Genes with similar expression patterns.

• Objects in the same cluster (group) are more similar to each other than objects in different clusters.

Clustering

• What is clustering?

• Similarity/distance metrics

• Hierarchical clustering algorithms– Made popular by Stanford, ie. [Eisen et al. 1998]

• K-means– Made popular by many groups, eg. [Tavazoie

et al. 1999]

• Self-organizing map (SOM) – Made popular by Whitehead, ie. [Tamayo et al.

1999]

Typical Tools

• SAM (Significance Analysis of Microarrays), Stanford

• GeneSpring• Affymetrix GeneChip Operating System

(GCOS)

• Cluster/Treeview• R statistics package microarray analysis

libraries.

How to define similarity?

• Similarity metric:

– A measure of pairwise similarity or dissimilarity

– Examples:• Correlation coefficient• Euclidean distance

Experiments

genes

genes

genes

X

Y

X

Y

Raw matrix

Similarity matrix

1

n

1 p n

n

Similarity metrics

• Euclidean distance

• Correlation coefficient

2

1

)][][(∑=

−p

j

jYjX

p

jX

Xwhere

YjYXjX

YjYXjXp

j

p

j

p

j

p

j∑

∑ ∑

∑=

= =

= =

−−

−−1

1 1

22

1

][,

)][()][(

)][)(][(

Euclidean clustering = magnitude & Direction

Correlation clustering = direction

Sporulation-example

Self-organizing maps (SOM) [Kohonen 1995]

• Basic idea:

– map high dimensional data onto a 2D grid of nodes

– Neighboring nodes are more similar than points far away

Self-organizing maps (SOM)

SOM Clusters

Things learned from from microarray gene expression experiments

• Pathways not known to be involved

–Ontology?

• Novel genes involved in a known pathway

• “like” and “unlike” tissues

Transcription FactorsRegulatory Networks

• Identify co-regulated genes

• Search for common motifs (transcription factor binding sites)

–Evaluate known motifs/factors

–Search for new ones.• Programs: MEME, etc.

mRNA-protein Correlation

• YPD: should have relevant data

– will yeast be typical?

• Electrophoresis 18:533

– 23 proteins on 2D gels

– r=0.48 for mRNA=protein

• Post transcriptional and post translational regulation important!

Other microarray formats

• Single nucleotide polymorphism (SNP) chips

– Oligos with each of 4 nt at each SNP.

• Chromosomal IP chips (ChIP:chip)

– Determine transcription factor binding sites

– Promoter DNA on the chip.

• Alternative splicing chips

– Long oligos, covering alternatively spliced exons, or all exons.

• Genome tiling chips

ChIP:chip--Identification of Transcription Factor Binding Sites

• Cross link transcription factors to DNA with formaldehyde

• Pull out transcription factor of interest via immunoprecipitation with an antibody or by tagging the factor of interest with an isolatable epitope (e.g GST fusion).

• Fractionate the DNA associated with the transcription factor, reverse the cross links, label and hybridize to an array of protomer DNA.

• Brown et.al. (2001) Nature, 409(533-8)

ChIP:chipAnalysis of TF Binding Sites

On to Proteomics

DNARNA Protein

microarray analysis quantitation of gene expression expression data to networks bio520...

Documents