exploring data multidimensionally

1
1-D Dmitry Grapov 1,2 , John W. Newman 2,3 1 Agricultural and Environmental Chemistry, University of California Davis, 2 Obesity & Metabolism Research Unit, USDA-ARS Western Human Nutrition Research Center, Davis, CA, 3 Nutrition, University of California Davis, Davis, CA, Exploring Data Multidimensionally This work was supported by the National Institutes of Health [T32-GM008799, R01DK078328-01] and the United States Department of Agriculture [5306-51530-019-00D]. imDEV: interactive modules for Data Visualization and Exploration (http://sourceforge.net/projects/imdev/) Type 2 Diabetes Trends Hypotheses: Environment + Genotype = Disease Phenotype Unknowns: Markers of Disease Processes No Data <4.5% 4.5-5.9% 6.0-7.4% 7.5-8.9% > 9.0% 1994 2000 2009 Gathering Information: Quantitative Metabolomics Interpretation: Multivariate Data Analysis and Visualization Mathematical Dimensions 2-D n-D Connection Screenshot of the MS Excel embedded imDEV interface utilizing R to generate: A) PCA scores and loadings trellis-plot displaying the first three components with subject scores’ (bottom left) colors annotated by gender and sized relative to individuals CRF values with an outlier highlighted in red, and loadings plots (top right) sized to display variable p-values based on Mann-Whitney U-test for gender. B) Variable distribution and scatter plot matrix used to evaluate the effects of the covariate adjustment. C) Overview plots used to evaluate the gender-adjusted PLS predictive model for CRF performance through comparison of the models Q2 and RMSEP statistics to their respective permuted null distributions. D) Multidimensionally scaled two and three dimensional variable correlation networks used to visualize variable and variable group (polygons) intercorrelations. Predicti on Projecti on Correlat ion Visual Analytical Dimensions Complex pathological states often involve systems of perturbations. In order to fully characterize, diagnose and treat these maladies, researchers are turning to ‘Omic’ methods to measure large arrays of genetic and biochemical reporters. Advances in analytical technologies are increasingly shifting the bottleneck for scientific discovery to data management, analysis and interpretation. The challenge of ‘Big Data’, may be met through applications of multivariate data analysis and visualization, which makes it possible to carry out the simultaneous analysis of many variables. Interactive modules for Data Exploration and Visualization, imDEV: http://sourceforge.net/apps/mediawik i/imdev , is an open source graphical user interface for multidimensional data analysis and visualization. Biological

Upload: dmitry-grapov

Post on 10-May-2015

294 views

Category:

Education


1 download

DESCRIPTION

Exploring Omics Data MultidimensionallyDmitry Grapov1,2, John W. Newman1,21Department of Nutrition, University of California, Davis CA, 95616.2Obesity and Metabolism Research Unit, USDA, ARS, Western Human Nutrition Research Center, Davis, CA, 95616 Omics experiments generate complex high dimensional data requiring the use of multivariate analysis techniques to properly evaluate. While a variety of commercial package exist to accomplish this task, they do not adequately support the interpretation of their results within a biological context. Interactive modules for data exploration and visualization (imDEV) is a Microsoft Excel spreadsheet embedded application providing an integrated environment for the analysis of omics data sets with a user-friendly interface. Individual modules were designed to provide toolsets to enable interactive and dynamic analyses of large data by interfacing R’s multivariate statistics and highly customizable visualizations with the spreadsheet environment of Microsoft Excel to aid robust inference and generate information rich data visualizations of multivariate data. These tools enable biologists to link complex multivariate methods such as principal (PCA) and independent component analyses (ICA), partial least squares regression (PLS) and discriminant analysis (PLS-DA) to high quality, dynamic, two and a three-dimensional visualizations including: scatter plot matrices, distribution plots, dendrograms, heat maps, biplots, trellis biplots and correlation networks, which enable visual data mining and improve the integration of analytical results within biological context. The developed user-friendly interface to multivariate statistical and visualization methods provides biologists the basic tools needed to carry out holistic systems level analyses of high dimensional experimental results.

TRANSCRIPT

Page 1: Exploring Data Multidimensionally

1-D

Dmitry Grapov1,2, John W. Newman2,3

1Agricultural and Environmental Chemistry, University of California Davis, 2Obesity & Metabolism Research Unit, USDA-ARS Western Human Nutrition Research Center, Davis, CA, 3Nutrition, University of California Davis, Davis, CA,

Exploring Data Multidimensionally

This work was supported by the National Institutes of Health [T32-GM008799, R01DK078328-01] and the United States Department of Agriculture [5306-51530-019-00D].

imDEV: interactive modules for Data Visualization and Exploration (http://sourceforge.net/projects/imdev/)

Type 2 Diabetes Trends

Hypotheses:Environment + Genotype = Disease Phenotype

Unknowns: Markers of Disease Processes

No Data <4.5% 4.5-5.9% 6.0-7.4% 7.5-8.9% >9.0%

1994 2000 2009

Gathering Information:Quantitative Metabolomics

Interpretation:Multivariate Data Analysis and Visualization

Mathematical Dimensions2-D n-D

Connection

Screenshot of the MS Excel embedded imDEV interface utilizing R to generate: A) PCA scores and loadings trellis-plot displaying the first three components with subject scores’ (bottom left) colors annotated by gender and sized relative to individuals CRF values with an outlier highlighted in red, and loadings plots (top right) sized to display variable p-values based on Mann-Whitney U-test for gender. B) Variable distribution and scatter plot matrix used to evaluate the effects of the covariate adjustment. C) Overview plots used to evaluate the gender-adjusted PLS predictive model for CRF performance through comparison of the models Q2 and RMSEP statistics to their respective permuted null distributions. D) Multidimensionally scaled two and three dimensional variable correlation networks used to visualize variable and variable group (polygons) intercorrelations.

Prediction

ProjectionCorrelation

Visual

An

alyt

ical

Dim

ensi

on

s

Complex pathological states often involve systems of perturbations. In order to fully characterize, diagnose and treat these maladies, researchers are turning to ‘Omic’ methods to measure large arrays of genetic and biochemical reporters. Advances in analytical technologies are increasingly shifting the bottleneck for scientific discovery to data management, analysis and interpretation.

The challenge of ‘Big Data’, may be met through applications of multivariate data analysis and visualization, which makes it possible to carry out the simultaneous analysis of many variables. Interactive modules for Data Exploration and Visualization, imDEV: http://sourceforge.net/apps/mediawiki/imdev,

is an open source graphical user interface for multidimensional data analysis and visualization.

Biological