high dimensional biological data analysis and visualization

Dmitry Grapov, PhD

Metabolomic Data Analysis for the Study of Diseases

State of the art facility producing massive amounts of biological data…

>13,000 samples/yr>160 studies~32,000 data points/study

Goals?

Analysis at the Metabolomic Scale

Univariate vs. MultivariateUnivariate

Gro

up 1

Gro

up 2

Multivariate Predictive Modeling

Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA

univariate/bivariate vs.

\ multivariate

mixed up samples?outliers?

Univariate vs. Multivariate

Data Complexity

nm

1-D 2-D m-D

Data

samples

variables

complexity

Meta Data

Experimental Design =

Variable # = dimensionality

Statistical Analysis• Identify differences in sample population

means• sensitive to distribution shape

• parametric = assumes normality

• error in Y, not in X (Y = mX + error)

• optimal for long data

• assumed independence

• false discovery rate (FDR) long

wide

n-of-one

Achieving “significance” is a function of:

significance level (α) and power (1-β )

effect size (standardized difference in means)

sample size (n)

Type I Error: False Positives

• Type II Error: False Negatives

• Type I risk =

• 1-(1-p.value)m

m = number of variables tested

FDR correction

• p-value adjustment or estimate of FDR (Fdr, q-value)

False Discovery Rate (FDR)

Bioinformatics (2008) 24 (12):1461-1462

FDR correctionFD

R ad

just

ed p

-val

ue

p-value

Benjamini & Hochberg (1995) (“BH”)• Accepted standard

Bonferroni• Very conservative• adjusted p-value = p-value*# of tests (e.g. 0.005 * 148 = 0.74 )

Multivariate AnalysisClustering• Grouping based on similarity/dissimilarity

Principal Components Analysis (PCA)• Identify modes of variance in the data

Partial Least Squares (PLS) • Identify modes of variance in the data

correlated with a hypothesis

Cluster AnalysisUse similarity/dissimilarity to group a collection of samples or variables

Approaches• hierarchical (HCA)• non-hierarchical (k-NN, k-means)• distribution (mixtures models)• density (DBSCAN)• self organizing maps (SOM)

Linkage k-means

Distribution Density

Hierarchical Cluster Analysissimilarity/dissimilarity defines “nearness” or distance

objects are grouped based on linkage methods

Hierarchy of Similarity

Sim

ilarit

y

x

xx

x

How does my metadata match my data structure?

Hierarchy of effect sizes

Projection of Data

The algorithm defines the position of the light sourcePrincipal Components Analysis (PCA)

• unsupervised• maximize variance (X)

Partial Least Squares Projection to Latent Structures (PLS)

• supervised• maximize covariance (Y ~ X)

James X. Li, 2009, VisuMap Tech.

PC1PC2

http://www.scholarpedia.org/article/Eigenfaces

Raw data PCA dimensions

Interpreting PCA Results

Variance explained (eigenvalues)

Row (sample) scores and column (variable) loadings

How are scores and loadings related?

Centering and Scaling

PMID: 16762068

Use PLS to test a hypothesis

time = 0 120 min.

Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis)

PCA PLS

PLS model validation is critical

Determine in-sample (Q2) and out-of-sample error (RMSEP) and compare to a random model

• permutation tests

• training/testing

Databases for organism specific biochemical information:

Multiple organisms

• KEGG

• BioCyc

• Reactome

Human

• HMDB

• SMPDB

Biochemical domain information

Pathway Enrichment Analysis

http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp

enrichmenttopological importance

http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp

Biochemical

Network Mapping

doi:10.1186/1471-2105-13-99

Structural Similarity

Data visualization as form of analysis

DM

Liver CYP2D6

Dextromethorphan = additives in

dextrorphan

• high fructose corn syrup

• antioxidants

• flavor

Identification of relationships between altered metabolites urea cycle

nucleotide

synthesis

protein

glycosylation

Identification of treatment effects

Analysis of differential metabolic responses

Treatment 1 Treatment 2

Resources• DeviumWeb- Dynamic multivariate data analysis and

visualization platformurl: https://github.com/dgrapov/DeviumWeb

• imDEV- Microsoft Excel add-in for multivariate analysisurl: http://sourceforge.net/projects/imdev/

• MetaMapR: Network analysis tools for metabolomicsurl: https://github.com/dgrapov/MetaMapR

• TeachingDemos- Tutorials and demonstrations• url: http://sourceforge.net/projects/teachingdemos/?source=directory• url: https://github.com/dgrapov/TeachingDemos

• CDS Blog- Data analysis case studiesurl: http://imdevsoftware.wordpress.com/

https://github.com/dgrapov/DeviumWeb

http://sourceforge.net/projects/imdev/

https://github.com/dgrapov/MetaMapR

http://sourceforge.net/projects/teachingdemos/?source=directory

https://github.com/dgrapov/TeachingDemos

http://imdevsoftware.wordpress.com/

[email protected] metabolomics.ucdavis.edu

This research was supported in part by NIH 1 U24 DK097154

high dimensional biological data analysis and visualization

Education

metabolomic data analysis

multivariate analysis

statistical analysis

data visualization

data complexity meta

data pointsstudy

data structure

network analysis tools