mev: joe white
TRANSCRIPT
Analysis of Multiple Experiments
TIGR Multiple Experiment Viewer (MeV)
Joseph White DFCIJanuary 24,2008
MeV
• Stand-alone java application for analysis
• New version: 4.1
• Not database centric; uses TDMS files
• Writes TDMS files
• Primarily for normalized data
• MeV does not currently write MAGE-TAB
• Download MeV from: tm4.org
Outline
• Description of MeV• How MeV treats expression• Some essential concepts• Demo: basic operations in MeV
– New file loader– ANOVA example
• Demo of MeV new features– Affymetrix file reader– Non-parametric tests– CGH
• GCOD
The Expression Matrix is a representation of data from multiplemicroarray experiments.
Each element is a log ratio(usually log 2 (Cy5 / Cy3) )
Red indicates a positive log ratio, i.e, Cy5 > Cy3
Green indicates anegative log ratio , i.e.,Cy5 < Cy3
Black indicates a logratio of zero, i. e., Cy5 and Cy3 are very close in value
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gray indicates missing data
Expression Vectors-Gene Expression Vectors
encapsulate the expression of a gene over a set of experimental conditions or sample types.
-0.8 0.8 1.5 1.8 0.5 -1.3 -0.4 1.5
-2
0
2
1 2 3 4 5 6 7 8Log2(cy5/cy3)
Expression Vectors As Points in‘Expression Space’
Experiment 1
Experiment 2
Experiment 3
Similar Expression
-0.8
-0.60.9 1.2
-0.3
1.3
-0.7Exp 1 Exp 2 Exp 3
G1
G2
G3
G4
G5
-0.4-0.4
-0.8-0.8
-0.7
1.3 0.9 -0.6
Distance and Similarity
-the ability to calculate a distance (or similarity, it’s inverse) between two expression vectors is fundamental to clustering algorithms
-distance between vectors is the basis upon which decisions are made when grouping similar patterns of expression
-selection of a distance metric defines the concept of distance
Distance: a measure of similarity between genes.
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Gene A
Gene B
x1A x2A x3A x4A x5A x6A
x1B x2B x3B x4B x5B x6B
Some distances: (MeV provides 11 metrics)
1. Euclidean: i = 1 (xiA - xiB)26
2. Manhattan: i = 1 |xiA – xiB|6
3. Pearson correlation
p0
p1
Distance is Defined by a Metric
-2
0
2
log2
(cy5
/cy3
)
Euclidean Pearson(r*-1)Distance Metric:
4.2
1.4
-1.00
-0.90D
D
Normal distribution
X = μ (mean of the distribution)
σ = std. deviationof the distribution
Current MeV Algorithms
• Hierarchical Clustering• K Means clustering• Support Trees for HCL• EASE (annotation clustering• Self-organizing maps• K-Nearest Neighbors• Support Vector Machines• Relevance Networks• Template Matching• PCA• CGH• Bayesean Networks
• T-test• ANOVA
– One and two factor
• SAM• Non-parametric tests
– Wilcoxon
– Fisher Exact Test
– Mack-Skillings
– Kruskat-Wallins
• BRIDGE
Demos
• File loaders
• HTA data: ANOVA
• Affymetrix data: SAM
• Non-Parametric tests
• CGH
GeneChip Oncology DatabaseBreast Cancer
(10%)
CNS Tumor (14%)
Head and Neck (12%)
Leukemia (28%)
Lung Cancer (6%)
Prostate Cancer (10%)
Ovarian Cancer (4%)
Other (16%)
GeneChip Oncology Database
0
400
800
1200
1600
HG-U133A HG_U95A(v2) Hu6800 Other
Nu
mb
er o
f C
hip
s
0
5
10
15
20
25
> 200 100 ~ 200 50 ~ 100 20 ~ 50 < 20
Number of Chips per Study
Nu
mb
er o
f S
tud
ies
GCOD statistics
• Studies: 52• Hybridizations: 4591• Analysis Result sets: 12,637• Signal values: 204,296,195• Samples: 3644• Probesets: 160,817
eg. (HG-U133A: 22,293)
(HG_U133_Plus_2: 54,684)
• Arraydesigns: 9• Accessions: 54,414