microarray analysis quantitation of gene expression expression data to networks bio520...
TRANSCRIPT
Microarray analysis
Quantitation of Gene Expression
Expression Data to Networks
BIO520 Bioinformatics Jim Lund
Reading: Ch 16
Microarray data
• Image quantitation.• Normalization• Find genes with significant
expression differences• Annotation• Clustering, pattern analysis,
network analysis
Sources of Non-Biological Variation
• Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation
• Differences in the amount of labeled cDNA hybridized to each channel in a microarray experiment (Channel is used to refer to a combination of a dye and a slide.)
• Variation across replicate slides
• Variation across hybridization conditions
• Variation in scanning conditions
• Variation among technicians doing the lab work.
Factors which impact on the signal level
• Amount of mRNA
• Labeling efficiencies
• Quality of the RNA
• Laser/dye combination
• Detection efficiency of photomultiplier or CCD
HelaHepG2
HelaHepG2
A = (Log Green + Log Red) / 2
M =
Lo
g (
Red
- L
og
Gre
en
M vs. A Plot
M v A plots of chip pairs: before normalization
M v A plots of chip pairs: after quantile normalization
Types of normalization
• To total signal (linear normalization)• LOESS (LOcally WEighted polynomial
regreSSion).• To “house keeping genes”• To genomic DNA spots (Research
Genetics) or mixed cDNA’s• To internal spikes
Microarray analysis
• Data exploration: expression of gene X?
• Statistical analysis: which genes show large, reproducible changes?
• Clustering: grouping genes by expression pattern.
• Knowledge-based analysis: Are amine synthesis genes involved in this experiment?
HelaHepG2
Fold change: the crudest method of finding differentially expressed genes
>2-fold expression change
>2-fold expression change
What do we mean by differentially expressed?
• Statistically, our gene is different from the other genes.
Num
ber
of g
enes
Log ratio
Distribution of average ratios for all genes
Probability of a given
Value of the ratio
Distribution of measurements for gene of interest
Finding differentially expressed genes
What affects our certainty that a gene is up or down-regulated?
• Number of sample points• Difference in means• Standard deviations of
sample
SampleA
SampleB
Pro
be S
igna
l
Practical views on statistics
• With appropriate biological replicates, it is possible to select statistically meaningful genes/patterns.
• Sensitivity and selectivity are inversely related - e.g. increased selection of true positives WILL result in more false positive and less false negatives.
• False negatives are lost opportunities, false positives cost $’s and waste time.
• A typical set of experiments treated with conservative statistics typically results in more genes/pathways/patterns than one can sensibly follow - so use conservative statistics to protect against false positives when designing follow-on experiments.
Statistical Tests
• Student’s t-test
– Correct for multiple testing! (Holm-Bonferroni)
• False discovery rate.
• Significance Analysis of Microarrays (SAM)
– http://www-stat.stanford.edu/~tibs/SAM/
• ANOVA
• Principal components analysis
• Special methods for periodic patterns in data.
Volcano plot: log(expr) vs p-value
Log(fold change)
p-va
lue
Scatter plot showing genes with significant p-values
Pattern finding
• In many cases, the patterns of differential expression are the target (as opposed to specific genes)– Clustering or other approaches for pattern
identification - find genes which behave similarly across all experiments or experiments which behave similarly across all genes
– Classification - identify genes which best distinguish 2 or more classes.
• The statistical reliability of the pattern or classifier is still an issue and similar considerations apply - e.g. cluster analysis of random noise will produce clusters which will be meaningless….
What is clustering?
• Group similar objects together.
– Genes with similar expression patterns.
• Objects in the same cluster (group) are more similar to each other than objects in different clusters.
Clustering
• What is clustering?
• Similarity/distance metrics
• Hierarchical clustering algorithms– Made popular by Stanford, ie. [Eisen et al. 1998]
• K-means– Made popular by many groups, eg. [Tavazoie
et al. 1999]
• Self-organizing map (SOM) – Made popular by Whitehead, ie. [Tamayo et al.
1999]
Typical Tools
• SAM (Significance Analysis of Microarrays), Stanford
• GeneSpring• Affymetrix GeneChip Operating System
(GCOS)
• Cluster/Treeview• R statistics package microarray analysis
libraries.
How to define similarity?
• Similarity metric:
– A measure of pairwise similarity or dissimilarity
– Examples:• Correlation coefficient• Euclidean distance
Experiments
genes
genes
genes
X
Y
X
Y
Raw matrix
Similarity matrix
1
n
1 p n
n
Similarity metrics
• Euclidean distance
• Correlation coefficient
2
1
)][][(∑=
−p
j
jYjX
p
jX
Xwhere
YjYXjX
YjYXjXp
j
p
j
p
j
p
j∑
∑ ∑
∑=
= =
= =
−−
−−1
1 1
22
1
][,
)][()][(
)][)(][(
Euclidean clustering = magnitude & Direction
Correlation clustering = direction
Sporulation-example
Sporulation-example
Self-organizing maps (SOM) [Kohonen 1995]
• Basic idea:
– map high dimensional data onto a 2D grid of nodes
– Neighboring nodes are more similar than points far away
Self-organizing maps (SOM)
SOM Clusters
Things learned from from microarray gene expression experiments
• Pathways not known to be involved
–Ontology?
• Novel genes involved in a known pathway
• “like” and “unlike” tissues
Transcription FactorsRegulatory Networks
• Identify co-regulated genes
• Search for common motifs (transcription factor binding sites)
–Evaluate known motifs/factors
–Search for new ones.• Programs: MEME, etc.
mRNA-protein Correlation
• YPD: should have relevant data
– will yeast be typical?
• Electrophoresis 18:533
– 23 proteins on 2D gels
– r=0.48 for mRNA=protein
• Post transcriptional and post translational regulation important!
Other microarray formats
• Single nucleotide polymorphism (SNP) chips
– Oligos with each of 4 nt at each SNP.
• Chromosomal IP chips (ChIP:chip)
– Determine transcription factor binding sites
– Promoter DNA on the chip.
• Alternative splicing chips
– Long oligos, covering alternatively spliced exons, or all exons.
• Genome tiling chips
ChIP:chip--Identification of Transcription Factor Binding Sites
• Cross link transcription factors to DNA with formaldehyde
• Pull out transcription factor of interest via immunoprecipitation with an antibody or by tagging the factor of interest with an isolatable epitope (e.g GST fusion).
• Fractionate the DNA associated with the transcription factor, reverse the cross links, label and hybridize to an array of protomer DNA.
• Brown et.al. (2001) Nature, 409(533-8)
ChIP:chipAnalysis of TF Binding Sites
On to Proteomics
DNARNA Protein