exploiting technical replicate variance in omics data analysis (repexplore)
TRANSCRIPT
Enrico GlaabLuxembourg Centre for Systems Biomedicine
Exploiting technical replicate variance in omics data analysis (RepExplore)
2
Introduction: Variance and replicates
• Biomedical omics datasets may contain different types of variance:
Variance across samples from different individuals:- Between-group variance- Within-group variance
Variance across samples from the same individual:- Intra-individual variance- Technical variance
• Between- and within-group variance are covered via biological replicates, intra-individual variance via time-series measurements, and technical variance via technical replicates
3
Motivation: Loss of technical variance information
• Typically, combinations of biological and summarized technical replicates are used for case-control analyses:
• Summarization of technical replicates into average measurements can provide more robust input data, but information on the variance across technical replicates is lost!
Mean/mediansummarization
Mean/mediansummarization
Statistical analysis
4
Example: Relevance of technical variance information
Abundance of the top differential metabolite (L-valine) in the Arabidopsis data by
Andersen et al. (2014) without (a) and with technical error (b, see whiskers)
without technical error with technical errora) b)
mean-summarizedtechnical replicates
5
Probability of positive likelihood ratio (PPLR)
• We assume the omics data on logarithmic scale has an approximate normal distribution
• Using mean (μ) and variance (s²) estimates derived from technical replicate measurements, differential expression can be scored using the probability of positive likelihood ratio (PPLR):
where P is the cumulative distribution function for the standard normal curve.
• For probe replicates on DNA microarray chips, Liu et al. propose a dedicated parameter estimation approach to incorporate the probe-level measurement error into variance estimates s² (Bioinformatics, 2006)
6
Comparison: Classical vs. PPLR top-ranked result
Whisker plots of the top differential metabolite in the Arabidopsis data by
Andersen et al. (2014): (a) classical approach (L-valine), (b) PPLR (L-proline)
a) b)
7
Application to Parkinson‘s disease transcriptome data
a) b)
a) Whisker plot of the top differential gene (NDUFB2) in the Parkinson’s diseasedataset by Simunovic et al. (2009); b) Heat map of the top 10 differentially expressed genes according to the PPLR statistic.
8
Improved robustness of rankings across studies
• Evaluation: Compare the similarity of gene rankings across two Parkinson‘s disease datasets (Simunovic et al., 2009, and Zheng et al., 2010)
• Two gene ranking methods are compared: PPLR and limma/eBayes (note: duplicateCorrelation-function not generally applicable)
• Kendall‘s tau used as similarity measure for rankings
Significantly higher similarity of rankings with PPLR statistic (p < 2.2E-16 for both up- and down-regulated genes)
9
Rankings improve with increasing numbers of replicates
• Apply method to simulated normal data with stddev. 1, 100 samples(two groups, 50 per group) and 1000 features/biomolecules (900 uncorrelated, irrelevant and 100 differential, fixed effect size of D = 1)
• Add simulated technical replicates with measurement noise (R function jitter) • Test enrichment of 100 differential features in the final PPLR ranking
Enrichment score increases with increasing numbers of replicates
10
Using replicate variance information in PCA
• Principal Component Analysis (PCA) can be cast as a probabilistic model (Tipping & Bishop, 1999) where d-dimensional data points yn can be reconstructed from a q-dimensional latent point xn via a linear transformation (W) and a noise vector εn:
• The corresponding data distribution is:
• If each dimension is allowed to have a different noise variance:
• While the maximum likelihood solution for the original model is equivalent to PCA, Sanguinetti et al. presented an expectation maximization (EM) method for the model with feature-specific noise variance (Bioinformatics, 2005)
11
Denoising property & results on Parkinson‘s disease data
• In the new probabilistic PCA, the noise estimate enables an evaluation of the significance of principal components (PCs) automated computation of the maximum number of retainable PCs
• By accounting for measurement error in the model, noisy values are down-
weighted pre-processing data via the modified PCA tends to provide tighter sample clusters (confirmed on Parkinson‘s disease transcriptome data)
12
Summary
• Technical replicate variance in omics data is usally not constant across different biomolecules only summarizing replicates via averaging will result in loss of valuable information
• On Parkinson‘s disease transcriptomics data, accounting for technical replicate variance provided improved robustness of differential expression rankings across studies and gave tighter clusters in PCA visualizations
• All methods presented have been implemented on a public web-server (RepExplore: www.repexplore.tk, Bioinformatics, 2015), providing automated analyses with interactive visualizations and ranking tables
13
References1. E. Glaab, R. Schneider, RepExplore: Addressing technical replicate variance in proteomics and metabolomics data analysis, Bioinformatics (2015),
31(13), pp. 22352. E. Glaab, Using prior knowledge from cellular pathways and molecular networks for diagnostic specimen classification, Briefings in Bioinformatics
(2015), in press (doi: 10.1093/bib/bbv044) 3. E. Glaab, R. Schneider, Comparative pathway and network analysis of brain transcriptome changes during adult aging and in Parkinson's disease,
Neurobiology of Disease (2015), 74, 1-134. N. Vlassis, E. Glaab, GenePEN: analysis of network activity alterations in complex diseases via the pairwise elastic net, Statistical Applications in
Genetics and Molecular Biology (2015), 14(2), pp. 2215. E. Glaab, Building a virtual ligand screening pipeline using free software: a survey, Briefings in Bioinformatics (2015), in press (doi:
10.1093/bib/bbv037)6. E. Glaab, A. Baudot, N. Krasnogor, R. Schneider, A. Valencia. EnrichNet: network-based gene set enrichment analysis, Bioinformatics, 28(18):i451-
i457, 20127. E. Glaab, R. Schneider, PathVar: analysis of gene and protein expression variance in cellular pathways using microarray data, Bioinformatics,
28(3):446-447, 20128. E. Glaab, J. Bacardit, J. M. Garibaldi, N. Krasnogor, Using rule-based machine learning for candidate disease gene prioritization and sample
classification of cancer gene expression data, PLoS ONE, 7(7):e39932, 2012 9. E. Glaab, A. Baudot, N. Krasnogor, A. Valencia. TopoGSA: network topological gene set analysis,
Bioinformatics, 26(9):1271-1272, 201010. E. Glaab, A. Baudot, N. Krasnogor, A. Valencia. Extending pathways and processes using molecular interaction networks to analyse cancer genome
data, BMC Bioinformatics, 11(1):597, 201011. H. O. Habashy, D. G. Powe, E. Glaab, N. Krasnogor, J. M. Garibaldi, E. A. Rakha, G. Ball, A. R Green, C. Caldas, I. O. Ellis, RERG (Ras-related and
oestrogen-regulated growth-inhibitor) expression in breast cancer: A marker of ER-positive luminal-like subtype, Breast Cancer Research and Treatment, 128(2):315-326, 2011
12. E. Glaab, J. M. Garibaldi and N. Krasnogor. ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization, BMC Bioinformatics,10:358, 2009
13. E. Glaab, J. M. Garibaldi, N. Krasnogor. Learning pathway-based decision rules to classify microarray cancer samples, German Conference on Bioinformatics 2010, Lecture Notes in Informatics (LNI), 173, 123-134
14. E. Glaab, J. M. Garibaldi and N. Krasnogor. VRMLGen: An R-package for 3D Data Visualization on the Web, Journal of Statistical Software, 36(8),1-18, 2010