multivariate data analysis and visualization tools for biological data

41
Multivariate Data Analysis and Visualization Tools for Understanding Biological Data Dmitry Grapov

Upload: dmitry-grapov

Post on 10-May-2015

2.196 views

Category:

Education


3 download

TRANSCRIPT

Page 1: Multivariate data analysis and visualization tools for biological data

Multivariate Data Analysis and Visualization

Tools for Understanding Biological Data Dmitry Grapov

Page 2: Multivariate data analysis and visualization tools for biological data

Introduction: Systems

Oltvai, et al. Science 25 October 2002: 763-764.

Emergent

Reductionist

Systems

Complex systems

Deterministic

Chemical analysis

Physiology Biochemistry

Graph theory

Modeling

Informatics

Page 3: Multivariate data analysis and visualization tools for biological data

Introduction: Inference

Page 4: Multivariate data analysis and visualization tools for biological data

Types: Univariate

1-D

Bivariate

2-D

Multivariate

n-D

Properties: vector matrix matrix

Representations: histograms

densities

scatter plots dendrograms

heatmaps

biplots networks

Central Idea: mean correlation many

http://www.thefullwiki.org/Hypercube

Overview

Page 5: Multivariate data analysis and visualization tools for biological data

Univariate: Properties

•vector of length m–mean–variance

Page 6: Multivariate data analysis and visualization tools for biological data

Univariate: Representations

Page 7: Multivariate data analysis and visualization tools for biological data

Univariate: Assumptions

•Normality

Page 8: Multivariate data analysis and visualization tools for biological data

Univariate: Utility

Hypothesis testing• α - type I error ( False Positive)•β - type II error ( False negative)•power - (1–β)•effect size - standardized difference in mean

Page 9: Multivariate data analysis and visualization tools for biological data

Univariate: Limitations

•Biological definition of the mean ?•Relationship between sample size and test power•Multiple hypothesis testing

• False discovery rate

Page 10: Multivariate data analysis and visualization tools for biological data

Old Faithful Data

272 observations

•time between eruptions– 70 ± 14 min

•duration of eruption– 3.5 ± 1 min

Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365

Page 11: Multivariate data analysis and visualization tools for biological data

•Matrix of 2 vectors of length m

Bivariate: Properties

Page 12: Multivariate data analysis and visualization tools for biological data

(X,Y)

Bivariate: Representations

Page 13: Multivariate data analysis and visualization tools for biological data

(X,Y)

Bivariate: Utility

Variable 2 = m*Variable 1 + b

•bivariate distribution

•correlation

Page 14: Multivariate data analysis and visualization tools for biological data

http://en.wikipedia.org/wiki/Correlation

Bivariate: Limitations

correlation coefficient•Measure of linear or monotonic relationship

Page 15: Multivariate data analysis and visualization tools for biological data

http://en.wikipedia.org/wiki/Correlation

Bivariate: Limitations

•Sensitive to outliers

Page 16: Multivariate data analysis and visualization tools for biological data

Old Faithful

Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365

Page 17: Multivariate data analysis and visualization tools for biological data

Old Unfaithful?

Page 18: Multivariate data analysis and visualization tools for biological data

Old Unfaithful?

Additional variables

•Nearby hydrofracking

•Improve inference based on more information

Page 19: Multivariate data analysis and visualization tools for biological data

Old Unfaithful?

Additional variables

•Nearby hydrofracking

•Improve inference based on more information

Page 20: Multivariate data analysis and visualization tools for biological data

Challenges

•data often wide structured

•integration

•noise

Rewards

•robust inference

•signal amplification

•holistic/systems approach

A matrix of n vectors of length m

Multivariate: Properties

Correlation matrix

Page 21: Multivariate data analysis and visualization tools for biological data

Principal Components Analysis (PCA)

Linear n-dimensional encoding of original data Where dimensions are:

1. orthogonal (uncorrelated)2. Top k dimensions are ordered by variance explained

PC 2

PC 1

Multivariate: Dimensional Reduction

Page 22: Multivariate data analysis and visualization tools for biological data

Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA (2003). LANL LA-UR-02-4001.

Scores LoadingsExplained variance

m x PC

PC x PC n x PC

Original Data

Calculating PCs: singular value decomposition (SVD)

Eigenvalue

•explained variance

Scores

•sample representation based on all variables

Loadings

•variable contribution to scores

Multivariate: Dimensional Reduction

Page 23: Multivariate data analysis and visualization tools for biological data

Old Faithful 2.0

•272 measurements

•8 variables

•2 real, 6 random noise

A matrix of n vectors of length m

Multivariate: Representations

Page 24: Multivariate data analysis and visualization tools for biological data

Multivariate: Representation

Identify outliers using all measurements Use known to impute missingIdentify interesting groups Evaluate uni- and bivariate observations

•Number of PCs can be used true data complexity

Page 25: Multivariate data analysis and visualization tools for biological data

PCA: Considerations

•data pre-treatment

•outliers

•noise

•unsupervised projection

no pre-treatment

centered and scaled to unit variance

Page 26: Multivariate data analysis and visualization tools for biological data

PCA: Considerations

•data pre-treatment

•outliers

•linear reconstruction

•noise

•Independent components analysis (ICA)

•unsupervised projection

Use ICA to calculate statistically independent components

Page 27: Multivariate data analysis and visualization tools for biological data

PCA: Considerations

•data pre-treatment

•outliers

•linear reconstruction

•noise

•supervised projection

•Non-negative matrix factorization (NMF)

NMF uses additive parts based encoding

Learning the parts of objects by nonnegative matrix factorization, D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.

Page 28: Multivariate data analysis and visualization tools for biological data

PCA: Considerations

•data pre-treatment

•outliers

•linear reconstruction

•noise

•supervised projection

•Identify projection correlated with class assignment (classification) or continuous variables (regression)

•Partial Least Squares Projection to Latent Structures (PLS/-DA)

Page 29: Multivariate data analysis and visualization tools for biological data

PLS/-DA: UtilityStrengths

•Predict multiple dependent variables

•avoids issues of multicollinearity

•Independent measure of variable importance

Weaknesses

•Need to derive an empirical reference for model performance

•Poor established model optimization methods

Page 30: Multivariate data analysis and visualization tools for biological data

PLS-DA: Example•Data: Old Faithful 2.0

•272 observations on 8 variables

•Latent Variables are analogous to PCs

•Important Statistics (CV)

•Q2 = fit

•RMSEP = error of prediction

•AU(RO)C = specificity vs. sensitivity

Select the appropriate number Latent Variables (LVs) to maximize Q2

Page 31: Multivariate data analysis and visualization tools for biological data

PLS-DA: Performance

•Use permutation tests to empirically determine model performance

Page 32: Multivariate data analysis and visualization tools for biological data

PLS-DA: Performance

•Use permutation tests to empirically determine model performance

Page 33: Multivariate data analysis and visualization tools for biological data

PLS: Predictive Performance

•Split data into training (2/3) and test sets (1/3)

•Generate model using training set and then predict class assignment for test set

•Use permutation tests to generate confidence bounds for future predictions

Page 34: Multivariate data analysis and visualization tools for biological data

PLS: Predictive Performance

Page 35: Multivariate data analysis and visualization tools for biological data

PLS: Feature SelectionUse the PLS-DA as an objective function to identify the

most informative variables

Page 36: Multivariate data analysis and visualization tools for biological data

Networks

Network: representation of relationships among objects

Utility

•Project statistical results into a biological context

•Explore informative data aspects in the context of all that was observed.

•Identify emergent patterns

Page 37: Multivariate data analysis and visualization tools for biological data

Networks•Interpret statistical results within a biological context

Page 38: Multivariate data analysis and visualization tools for biological data

Networks•Highlight changes in patterns of relationships.

non-diabetics type 2 diabetics

Page 39: Multivariate data analysis and visualization tools for biological data

Networks•Display complex interactions

non-diabetics type 2 diabetics

-0.2 -0.1 0.0 0.1 0.2 0.3 0.4

-0.2

0.0

0.2

0.4

0.6

T2D and UCP3 OPLS-DA Loadings

Increasing in T2D------------>

Incre

asin

g in

g/a

UC

P3

----

----

--->

1-LG

12(13)-EpODE

12,13-DiHOME

15(S)-HEPE

17,18-DiHETE

18:1n9

22:5n6

5-HETE

9-HETE

9(10)-EpOME

9,10-DiHOME

9,12,13-TriHOME

AEA

c16:0

c18:0

DHEA

LEA

NO-Gly

SEA

g/a

g/g

T2D non-T2D

Page 40: Multivariate data analysis and visualization tools for biological data

non-diabetics type 2 diabetics

imDEV: interactive modules for Data Exploration and Visualization

 An integrated environment for systems level analysis of multivariate data.

http://sourceforge.net/apps/mediawiki/imdev

Page 41: Multivariate data analysis and visualization tools for biological data

Acknowledgements

Newman Lab

Designated Emphasis in Biotechnology (DEB)

NIH

This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS 5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.