multivariate data analysis and visualization tools for biological data
TRANSCRIPT
Multivariate Data Analysis and Visualization
Tools for Understanding Biological Data Dmitry Grapov
Introduction: Systems
Oltvai, et al. Science 25 October 2002: 763-764.
Emergent
Reductionist
Systems
Complex systems
Deterministic
Chemical analysis
Physiology Biochemistry
Graph theory
Modeling
Informatics
Introduction: Inference
Types: Univariate
1-D
Bivariate
2-D
Multivariate
n-D
Properties: vector matrix matrix
Representations: histograms
densities
scatter plots dendrograms
heatmaps
biplots networks
Central Idea: mean correlation many
http://www.thefullwiki.org/Hypercube
Overview
Univariate: Properties
•vector of length m–mean–variance
Univariate: Representations
Univariate: Assumptions
•Normality
Univariate: Utility
Hypothesis testing• α - type I error ( False Positive)•β - type II error ( False negative)•power - (1–β)•effect size - standardized difference in mean
Univariate: Limitations
•Biological definition of the mean ?•Relationship between sample size and test power•Multiple hypothesis testing
• False discovery rate
Old Faithful Data
272 observations
•time between eruptions– 70 ± 14 min
•duration of eruption– 3.5 ± 1 min
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365
•Matrix of 2 vectors of length m
Bivariate: Properties
(X,Y)
Bivariate: Representations
(X,Y)
Bivariate: Utility
Variable 2 = m*Variable 1 + b
•bivariate distribution
•correlation
http://en.wikipedia.org/wiki/Correlation
Bivariate: Limitations
correlation coefficient•Measure of linear or monotonic relationship
http://en.wikipedia.org/wiki/Correlation
Bivariate: Limitations
•Sensitive to outliers
Old Faithful
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365
Old Unfaithful?
Old Unfaithful?
Additional variables
•Nearby hydrofracking
•Improve inference based on more information
Old Unfaithful?
Additional variables
•Nearby hydrofracking
•Improve inference based on more information
Challenges
•data often wide structured
•integration
•noise
Rewards
•robust inference
•signal amplification
•holistic/systems approach
A matrix of n vectors of length m
Multivariate: Properties
Correlation matrix
Principal Components Analysis (PCA)
Linear n-dimensional encoding of original data Where dimensions are:
1. orthogonal (uncorrelated)2. Top k dimensions are ordered by variance explained
PC 2
PC 1
Multivariate: Dimensional Reduction
Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA (2003). LANL LA-UR-02-4001.
Scores LoadingsExplained variance
m x PC
PC x PC n x PC
Original Data
Calculating PCs: singular value decomposition (SVD)
Eigenvalue
•explained variance
Scores
•sample representation based on all variables
Loadings
•variable contribution to scores
Multivariate: Dimensional Reduction
Old Faithful 2.0
•272 measurements
•8 variables
•2 real, 6 random noise
A matrix of n vectors of length m
Multivariate: Representations
Multivariate: Representation
Identify outliers using all measurements Use known to impute missingIdentify interesting groups Evaluate uni- and bivariate observations
•Number of PCs can be used true data complexity
PCA: Considerations
•data pre-treatment
•outliers
•noise
•unsupervised projection
no pre-treatment
centered and scaled to unit variance
PCA: Considerations
•data pre-treatment
•outliers
•linear reconstruction
•noise
•Independent components analysis (ICA)
•unsupervised projection
Use ICA to calculate statistically independent components
PCA: Considerations
•data pre-treatment
•outliers
•linear reconstruction
•noise
•supervised projection
•Non-negative matrix factorization (NMF)
NMF uses additive parts based encoding
Learning the parts of objects by nonnegative matrix factorization, D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.
PCA: Considerations
•data pre-treatment
•outliers
•linear reconstruction
•noise
•supervised projection
•Identify projection correlated with class assignment (classification) or continuous variables (regression)
•Partial Least Squares Projection to Latent Structures (PLS/-DA)
PLS/-DA: UtilityStrengths
•Predict multiple dependent variables
•avoids issues of multicollinearity
•Independent measure of variable importance
Weaknesses
•Need to derive an empirical reference for model performance
•Poor established model optimization methods
PLS-DA: Example•Data: Old Faithful 2.0
•272 observations on 8 variables
•Latent Variables are analogous to PCs
•Important Statistics (CV)
•Q2 = fit
•RMSEP = error of prediction
•AU(RO)C = specificity vs. sensitivity
Select the appropriate number Latent Variables (LVs) to maximize Q2
PLS-DA: Performance
•Use permutation tests to empirically determine model performance
PLS-DA: Performance
•Use permutation tests to empirically determine model performance
PLS: Predictive Performance
•Split data into training (2/3) and test sets (1/3)
•Generate model using training set and then predict class assignment for test set
•Use permutation tests to generate confidence bounds for future predictions
PLS: Predictive Performance
PLS: Feature SelectionUse the PLS-DA as an objective function to identify the
most informative variables
Networks
Network: representation of relationships among objects
Utility
•Project statistical results into a biological context
•Explore informative data aspects in the context of all that was observed.
•Identify emergent patterns
Networks•Interpret statistical results within a biological context
Networks•Highlight changes in patterns of relationships.
non-diabetics type 2 diabetics
Networks•Display complex interactions
non-diabetics type 2 diabetics
-0.2 -0.1 0.0 0.1 0.2 0.3 0.4
-0.2
0.0
0.2
0.4
0.6
T2D and UCP3 OPLS-DA Loadings
Increasing in T2D------------>
Incre
asin
g in
g/a
UC
P3
----
----
--->
1-LG
12(13)-EpODE
12,13-DiHOME
15(S)-HEPE
17,18-DiHETE
18:1n9
22:5n6
5-HETE
9-HETE
9(10)-EpOME
9,10-DiHOME
9,12,13-TriHOME
AEA
c16:0
c18:0
DHEA
LEA
NO-Gly
SEA
g/a
g/g
T2D non-T2D
non-diabetics type 2 diabetics
imDEV: interactive modules for Data Exploration and Visualization
An integrated environment for systems level analysis of multivariate data.
http://sourceforge.net/apps/mediawiki/imdev
Acknowledgements
Newman Lab
Designated Emphasis in Biotechnology (DEB)
NIH
This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS 5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.