high dimensional biological data analysis and visualization
DESCRIPTION
Examples of data analysis and visualization of high dimensional metabolomic data.TRANSCRIPT
Dmitry Grapov, PhD
Metabolomic Data Analysis for the Study of Diseases
State of the art facility producing massive amounts of biological data…
>13,000 samples/yr>160 studies~32,000 data points/study
Goals?
Analysis at the Metabolomic Scale
Univariate vs. MultivariateUnivariate
Gro
up 1
Gro
up 2
Multivariate Predictive Modeling
Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA
univariate/bivariate vs.
\ multivariate
mixed up samples?outliers?
Univariate vs. Multivariate
Data Complexity
nm
1-D 2-D m-D
Data
samples
variables
complexity
Meta Data
Experimental Design =
Variable # = dimensionality
Statistical Analysis• Identify differences in sample population
means• sensitive to distribution shape
• parametric = assumes normality
• error in Y, not in X (Y = mX + error)
• optimal for long data
• assumed independence
• false discovery rate (FDR) long
wide
n-of-one
Achieving “significance” is a function of:
significance level (α) and power (1-β )
effect size (standardized difference in means)
sample size (n)
Type I Error: False Positives
• Type II Error: False Negatives
• Type I risk =
• 1-(1-p.value)m
m = number of variables tested
FDR correction
• p-value adjustment or estimate of FDR (Fdr, q-value)
False Discovery Rate (FDR)
Bioinformatics (2008) 24 (12):1461-1462
FDR correctionFD
R ad
just
ed p
-val
ue
p-value
Benjamini & Hochberg (1995) (“BH”)• Accepted standard
Bonferroni• Very conservative• adjusted p-value = p-value*# of tests (e.g. 0.005 * 148 = 0.74 )
Multivariate AnalysisClustering• Grouping based on similarity/dissimilarity
Principal Components Analysis (PCA)• Identify modes of variance in the data
Partial Least Squares (PLS) • Identify modes of variance in the data
correlated with a hypothesis
Cluster AnalysisUse similarity/dissimilarity to group a collection of samples or variables
Approaches• hierarchical (HCA)• non-hierarchical (k-NN, k-means)• distribution (mixtures models)• density (DBSCAN)• self organizing maps (SOM)
Linkage k-means
Distribution Density
Hierarchical Cluster Analysissimilarity/dissimilarity defines “nearness” or distance
objects are grouped based on linkage methods
Hierarchy of Similarity
Sim
ilarit
y
x
xx
x
How does my metadata match my data structure?
Hierarchy of effect sizes
Projection of Data
The algorithm defines the position of the light sourcePrincipal Components Analysis (PCA)
• unsupervised• maximize variance (X)
Partial Least Squares Projection to Latent Structures (PLS)
• supervised• maximize covariance (Y ~ X)
James X. Li, 2009, VisuMap Tech.
PC1PC2
http://www.scholarpedia.org/article/Eigenfaces
Raw data PCA dimensions
Interpreting PCA Results
Variance explained (eigenvalues)
Row (sample) scores and column (variable) loadings
How are scores and loadings related?
Centering and Scaling
PMID: 16762068
Use PLS to test a hypothesis
time = 0 120 min.
Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis)
PCA PLS
PLS model validation is critical
Determine in-sample (Q2) and out-of-sample error (RMSEP) and compare to a random model
• permutation tests
• training/testing
Databases for organism specific biochemical information:
Multiple organisms
• KEGG
• BioCyc
• Reactome
Human
• HMDB
• SMPDB
Biochemical domain information
Pathway Enrichment Analysis
http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp
enrichmenttopological importance
Biochemical
Network Mapping
doi:10.1186/1471-2105-13-99
Structural Similarity
Data visualization as form of analysis
DM
Liver CYP2D6
Dextromethorphan = additives in
dextrorphan
• high fructose corn syrup
• antioxidants
• flavor
Identification of relationships between altered metabolites urea cycle
nucleotide
synthesis
protein
glycosylation
Identification of treatment effects
Analysis of differential metabolic responses
Treatment 1 Treatment 2
Resources• DeviumWeb- Dynamic multivariate data analysis and
visualization platformurl: https://github.com/dgrapov/DeviumWeb
• imDEV- Microsoft Excel add-in for multivariate analysisurl: http://sourceforge.net/projects/imdev/
• MetaMapR: Network analysis tools for metabolomicsurl: https://github.com/dgrapov/MetaMapR
• TeachingDemos- Tutorials and demonstrations• url: http://sourceforge.net/projects/teachingdemos/?source=directory• url: https://github.com/dgrapov/TeachingDemos
• CDS Blog- Data analysis case studiesurl: http://imdevsoftware.wordpress.com/
[email protected] metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154