high dimensional biological data analysis and visualization

30
Dmitry Grapov, PhD Metabolomic Data Analysis for the Study of Diseases

Upload: dmitry-grapov

Post on 10-May-2015

15.943 views

Category:

Education


3 download

DESCRIPTION

Examples of data analysis and visualization of high dimensional metabolomic data.

TRANSCRIPT

Page 1: High Dimensional Biological Data Analysis and Visualization

Dmitry Grapov, PhD

Metabolomic Data Analysis for the Study of Diseases

Page 2: High Dimensional Biological Data Analysis and Visualization

State of the art facility producing massive amounts of biological data…

>13,000 samples/yr>160 studies~32,000 data points/study

Page 3: High Dimensional Biological Data Analysis and Visualization

Goals?

Page 4: High Dimensional Biological Data Analysis and Visualization

Analysis at the Metabolomic Scale

Page 5: High Dimensional Biological Data Analysis and Visualization

Univariate vs. MultivariateUnivariate

Gro

up 1

Gro

up 2

Multivariate Predictive Modeling

Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA

Page 6: High Dimensional Biological Data Analysis and Visualization

univariate/bivariate vs.

\ multivariate

mixed up samples?outliers?

Univariate vs. Multivariate

Page 7: High Dimensional Biological Data Analysis and Visualization

Data Complexity

nm

1-D 2-D m-D

Data

samples

variables

complexity

Meta Data

Experimental Design =

Variable # = dimensionality

Page 8: High Dimensional Biological Data Analysis and Visualization

Statistical Analysis• Identify differences in sample population

means• sensitive to distribution shape

• parametric = assumes normality

• error in Y, not in X (Y = mX + error)

• optimal for long data

• assumed independence

• false discovery rate (FDR) long

wide

n-of-one

Page 9: High Dimensional Biological Data Analysis and Visualization

Achieving “significance” is a function of:

significance level (α) and power (1-β )

effect size (standardized difference in means)

sample size (n)

Page 10: High Dimensional Biological Data Analysis and Visualization

Type I Error: False Positives

• Type II Error: False Negatives

• Type I risk =

• 1-(1-p.value)m

m = number of variables tested

FDR correction

• p-value adjustment or estimate of FDR (Fdr, q-value)

False Discovery Rate (FDR)

Bioinformatics (2008) 24 (12):1461-1462

Page 11: High Dimensional Biological Data Analysis and Visualization

FDR correctionFD

R ad

just

ed p

-val

ue

p-value

Benjamini & Hochberg (1995) (“BH”)• Accepted standard

Bonferroni• Very conservative• adjusted p-value = p-value*# of tests (e.g. 0.005 * 148 = 0.74 )

Page 12: High Dimensional Biological Data Analysis and Visualization

Multivariate AnalysisClustering• Grouping based on similarity/dissimilarity

Principal Components Analysis (PCA)• Identify modes of variance in the data

Partial Least Squares (PLS) • Identify modes of variance in the data

correlated with a hypothesis

Page 13: High Dimensional Biological Data Analysis and Visualization

Cluster AnalysisUse similarity/dissimilarity to group a collection of samples or variables

Approaches• hierarchical (HCA)• non-hierarchical (k-NN, k-means)• distribution (mixtures models)• density (DBSCAN)• self organizing maps (SOM)

Linkage k-means

Distribution Density

Page 14: High Dimensional Biological Data Analysis and Visualization

Hierarchical Cluster Analysissimilarity/dissimilarity defines “nearness” or distance

objects are grouped based on linkage methods

Page 15: High Dimensional Biological Data Analysis and Visualization

Hierarchy of Similarity

Sim

ilarit

y

x

xx

x

How does my metadata match my data structure?

Hierarchy of effect sizes

Page 16: High Dimensional Biological Data Analysis and Visualization

Projection of Data

The algorithm defines the position of the light sourcePrincipal Components Analysis (PCA)

• unsupervised• maximize variance (X)

Partial Least Squares Projection to Latent Structures (PLS)

• supervised• maximize covariance (Y ~ X)

James X. Li, 2009, VisuMap Tech.

PC1PC2

http://www.scholarpedia.org/article/Eigenfaces

Raw data PCA dimensions

Page 17: High Dimensional Biological Data Analysis and Visualization

Interpreting PCA Results

Variance explained (eigenvalues)

Row (sample) scores and column (variable) loadings

Page 18: High Dimensional Biological Data Analysis and Visualization

How are scores and loadings related?

Page 19: High Dimensional Biological Data Analysis and Visualization

Centering and Scaling

PMID: 16762068

Page 20: High Dimensional Biological Data Analysis and Visualization

Use PLS to test a hypothesis

time = 0 120 min.

Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis)

PCA PLS

Page 21: High Dimensional Biological Data Analysis and Visualization

PLS model validation is critical

Determine in-sample (Q2) and out-of-sample error (RMSEP) and compare to a random model

• permutation tests

• training/testing

Page 22: High Dimensional Biological Data Analysis and Visualization

Databases for organism specific biochemical information:

Multiple organisms

• KEGG

• BioCyc

• Reactome

Human

• HMDB

• SMPDB

Biochemical domain information

Page 23: High Dimensional Biological Data Analysis and Visualization

Pathway Enrichment Analysis

http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp

enrichmenttopological importance

Page 24: High Dimensional Biological Data Analysis and Visualization

Biochemical

Network Mapping

doi:10.1186/1471-2105-13-99

Structural Similarity

Page 25: High Dimensional Biological Data Analysis and Visualization

Data visualization as form of analysis

DM

Liver CYP2D6

Dextromethorphan = additives in

dextrorphan

• high fructose corn syrup

• antioxidants

• flavor

Page 26: High Dimensional Biological Data Analysis and Visualization

Identification of relationships between altered metabolites urea cycle

nucleotide

synthesis

protein

glycosylation

Page 27: High Dimensional Biological Data Analysis and Visualization

Identification of treatment effects

Page 28: High Dimensional Biological Data Analysis and Visualization

Analysis of differential metabolic responses

Treatment 1 Treatment 2

Page 29: High Dimensional Biological Data Analysis and Visualization

Resources• DeviumWeb- Dynamic multivariate data analysis and

visualization platformurl: https://github.com/dgrapov/DeviumWeb

• imDEV- Microsoft Excel add-in for multivariate analysisurl: http://sourceforge.net/projects/imdev/

• MetaMapR: Network analysis tools for metabolomicsurl: https://github.com/dgrapov/MetaMapR

• TeachingDemos- Tutorials and demonstrations• url: http://sourceforge.net/projects/teachingdemos/?source=directory• url: https://github.com/dgrapov/TeachingDemos

• CDS Blog- Data analysis case studiesurl: http://imdevsoftware.wordpress.com/

Page 30: High Dimensional Biological Data Analysis and Visualization

[email protected] metabolomics.ucdavis.edu

This research was supported in part by NIH 1 U24 DK097154