multivariate data analysis and visualization tools for biological data

Multivariate Data Analysis and Visualization

Tools for Understanding Biological Data Dmitry Grapov

Introduction: Systems

Oltvai, et al. Science 25 October 2002: 763-764.

Emergent

Reductionist

Systems

Complex systems

Deterministic

Chemical analysis

Physiology Biochemistry

Graph theory

Modeling

Informatics

Introduction: Inference

Types: Univariate

1-D

Bivariate

2-D

Multivariate

n-D

Properties: vector matrix matrix

Representations: histograms

densities

scatter plots dendrograms

heatmaps

biplots networks

Central Idea: mean correlation many

http://www.thefullwiki.org/Hypercube

Overview

Univariate: Properties

•vector of length m–mean–variance

Univariate: Representations

Univariate: Assumptions

•Normality

Univariate: Utility

Hypothesis testing• α - type I error ( False Positive)•β - type II error ( False negative)•power - (1–β)•effect size - standardized difference in mean

Univariate: Limitations

•Biological definition of the mean ?•Relationship between sample size and test power•Multiple hypothesis testing

• False discovery rate

Old Faithful Data

272 observations

•time between eruptions– 70 ± 14 min

•duration of eruption– 3.5 ± 1 min

Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365

•Matrix of 2 vectors of length m

Bivariate: Properties

(X,Y)

Bivariate: Representations

(X,Y)

Bivariate: Utility

Variable 2 = m*Variable 1 + b

•bivariate distribution

•correlation

http://en.wikipedia.org/wiki/Correlation

Bivariate: Limitations

correlation coefficient•Measure of linear or monotonic relationship

http://en.wikipedia.org/wiki/Correlation

Bivariate: Limitations

•Sensitive to outliers

Old Faithful

Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365

Old Unfaithful?

Old Unfaithful?

Additional variables

•Nearby hydrofracking

•Improve inference based on more information

Challenges

•data often wide structured

•integration

•noise

Rewards

•robust inference

•signal amplification

•holistic/systems approach

A matrix of n vectors of length m

Multivariate: Properties

Correlation matrix

Principal Components Analysis (PCA)

Linear n-dimensional encoding of original data Where dimensions are:

1. orthogonal (uncorrelated)2. Top k dimensions are ordered by variance explained

PC 2

PC 1

Multivariate: Dimensional Reduction

Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA (2003). LANL LA-UR-02-4001.

Scores LoadingsExplained variance

m x PC

PC x PC n x PC

Original Data

Calculating PCs: singular value decomposition (SVD)

Eigenvalue

•explained variance

Scores

•sample representation based on all variables

Loadings

•variable contribution to scores

Multivariate: Dimensional Reduction

Old Faithful 2.0

•272 measurements

•8 variables

•2 real, 6 random noise

A matrix of n vectors of length m

Multivariate: Representations

Multivariate: Representation

Identify outliers using all measurements Use known to impute missingIdentify interesting groups Evaluate uni- and bivariate observations

•Number of PCs can be used true data complexity

PCA: Considerations

•data pre-treatment

•outliers

•noise

•unsupervised projection

no pre-treatment

centered and scaled to unit variance

PCA: Considerations


•outliers

•linear reconstruction

•noise

•Independent components analysis (ICA)

•unsupervised projection

Use ICA to calculate statistically independent components

PCA: Considerations


•outliers


•noise

•supervised projection

•Non-negative matrix factorization (NMF)

NMF uses additive parts based encoding

Learning the parts of objects by nonnegative matrix factorization, D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.

PCA: Considerations


•outliers


•noise

•supervised projection

•Identify projection correlated with class assignment (classification) or continuous variables (regression)

•Partial Least Squares Projection to Latent Structures (PLS/-DA)

PLS/-DA: UtilityStrengths

•Predict multiple dependent variables

•avoids issues of multicollinearity

•Independent measure of variable importance

Weaknesses

•Need to derive an empirical reference for model performance

•Poor established model optimization methods

PLS-DA: Example•Data: Old Faithful 2.0

•272 observations on 8 variables

•Latent Variables are analogous to PCs

•Important Statistics (CV)

•Q2 = fit

•RMSEP = error of prediction

•AU(RO)C = specificity vs. sensitivity

Select the appropriate number Latent Variables (LVs) to maximize Q2

PLS-DA: Performance

•Use permutation tests to empirically determine model performance

PLS: Predictive Performance

•Split data into training (2/3) and test sets (1/3)

•Generate model using training set and then predict class assignment for test set

•Use permutation tests to generate confidence bounds for future predictions

PLS: Predictive Performance

PLS: Feature SelectionUse the PLS-DA as an objective function to identify the

most informative variables

Networks

Network: representation of relationships among objects

Utility

•Project statistical results into a biological context

•Explore informative data aspects in the context of all that was observed.

•Identify emergent patterns

Networks•Interpret statistical results within a biological context

Networks•Highlight changes in patterns of relationships.

non-diabetics type 2 diabetics

Networks•Display complex interactions


-0.2 -0.1 0.0 0.1 0.2 0.3 0.4

-0.2

0.0

0.2

0.4

0.6

T2D and UCP3 OPLS-DA Loadings

Increasing in T2D------------>

Incre

asin

g in

g/a

UC

P3

----

----

--->

1-LG

12(13)-EpODE

12,13-DiHOME

15(S)-HEPE

17,18-DiHETE

18:1n9

22:5n6

5-HETE

9-HETE

9(10)-EpOME

9,10-DiHOME

9,12,13-TriHOME

AEA

c16:0

c18:0

DHEA

LEA

NO-Gly

SEA

g/a

g/g

T2D non-T2D


imDEV: interactive modules for Data Exploration and Visualization

An integrated environment for systems level analysis of multivariate data.

http://sourceforge.net/apps/mediawiki/imdev



Acknowledgements

Newman Lab

Designated Emphasis in Biotechnology (DEB)

NIH

This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS 5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.

multivariate data analysis and visualization tools for biological data

Education

old faithful data

multivariate data analysis

challenges data

example data

microarray data analysis

variables latent variables

true data complexity

multivariate nd bivariate