new challenges in bioinformatics: integrative analysis of omics … · new challenges in...
Post on 04-Jun-2020
1 Views
Preview:
TRANSCRIPT
New Challenges in Bioinformatics:Integrative Analysis of Omics Data
Alex Sánchez
1Statistics and Bioinformatics Research GroupStatistics department, Universitat de Barelona
2Statistics and Bioinformatics UnitVall d’Hebron Institut de Recerca
1
Outline Introduction:
omics, data integration, integrative analysis Integrative analysis: challenges and methods Some (prototypical) examples
Multivariate statistical approach to integrative analysis Building better predictors from diverse data sources Gene sets and its application to integrative analysis Network methods for visualization and data integration
Where to now?
Who, where, what?
3
Omics data
123456789 p p m
H NMR metabolites
Affy Transcriptome
LC-MS proteomicss
Adiponectin (change from baseline)
-15
-10
-5
0
5
10
15day 7day 14
db/+ db/db
Veh Met30
Gly1
Gly3
Met75 Veh Met
30Gly1
Gly3
Met75
*
Adipon
ectin
(ug/ml)
“Non-omic” markers
Veh A B C D Veh A B C DNormal Disease
A
A
Experimental Platforms generatediverse omics and non-omics data
“NGS-Sequences
5
6
Genomics
• Uses sequencing technologies to study genomes and intragenomic phenomena.
• Data: DNA sequences
7
Transcriptomics• The transcriptome is the
set of all RNA molecules, in one or a population of cells.
• Transcriptomics, examines expression levels of mRNAs in a given cell population,
• Technologies• Microarrays• Next Generation
Sequencing
8
Proteomics• The large-scale study
of proteins (the proteome)• (3D) structures and • functions.
• Spectra of techniques• 2D gel based• Mass Spectrometry (MS)• Seldi-TOF (MS)• Protein arrays,• …
9
Metabolomics• Comprehensive and
simultaneous systematic determination of• metabolite levels in the
metabolome and • their changes over time as
a consequence of stimuli.• Relies on
• Separation techniques• GC, CE, HPLC, UPLC
• Detection techniques• NMR, MS
CEMCAT-Neuroimmunology10
Altogether: The central dogma and the omics cascade
Why would we want to integrate data?
Why should we integrate data?
What we learn from an experiment may depend on where we look, how we look, and the scope of our view!
The Blind Men and the Elephant
http://www.noogenesis.com/pineapple/blind_men_elephant.html
Focusing on one platform risks missing an obvious signal!!!
13
From componentwise to global approaches
It is expected that the integrated collection and analysis of diverse types of data,
jointly modelled and analyzed in a systems biology approach
can shed light on the global functioning of biological systems.
Ultimate Goal: understanding of complex processes
15
Integrative Analysis & Data integration: methods, types, challenges
16
Data Integration is cool
• Everywhere nowadays in Biology Medicine, Bioinformatics, …• Meetings
• Barcelona (Feb. 2013), Leiden (Apr. 2013), Ascona (May 2013)
• Finnancing (FP7): projects with > 106 € each• Stategra• MimOmics
• Try googling with the terms 'omics data integration'
But what is Data Integration?
◦ “Data integration” may mean different things...◦ Computational combination of data ◦ Combination of studies performed independently◦ Simultaneous analysis of multiple variables on multiple
datasets.◦ Not to mention any possible approach for
homogeneously querying heterogeneous data sources
Integrative analysis may be preferable
There are many types of integrative analysis
Hamid et al. 200919
There are many methods ….
• Decision trees, Bayesian networks, Support vector machines, Graph algorithms, Multivariate analysis,
There are many issues to be addressed
Data-Preprocessing Data of same or different types
High (but "cursed") dimensionality N << p Datasets of different sizes (104 genes, 103 proteins) Multiple testing issues
Missing values Some values missing for some individuals Non rectangularity of the data
Biological interpretation
So what?
• We willl restrict to arbitrarily chosen examples providing an overview of the field without pretending to cover it all.
• Case studies.– Combining biological knowledge with omics data using
multivariate statistics.– How to obtain improved cancer predictors by
aggregating datasets.– Using network biology methods for traslational cancer
research.
Some examples
23
Integrative Analysis of the Relationship Between Insulin Resistance and Gut Microbiota
24
Insulin Resistance
Insulin resistance means cells become less sensitive to insulin,
This provokes the pancreas to over-compensate by working harder and releasing even more insulin.
Insulin-resistance + Insulin over-production leads to two common outcomes: diabetes, or obesity
25
IS/IR and Gut Microbiota
Human gut microbiome is related to health & weight◦ varies in healthy people◦ varies in lean and obese
It is reasonable to postulate insuline sensitivity to be associated with changes in bacterial microflora.
26
Data for relating IR/IS with Microbiome
Clinical variables (BMI, Homa, Ins, HDL, …) Microarrays
Expression matrix an related annotations (GO) Microbial flora diversity based on
Denaturing Gradient Gel Electrophoresis Metagenomic shotgun NGS sequencing
Clin1 ······ ClinK1 DGGE1 ······ DGGEK2 Expr1 ······ ExprK3 GeneSet1 ······ GeneSetK4 Spec1 ······ SpecK5IS_NoD_10IS_NoD_11IS_NoD_12IR_NoD_13IR_NoD_14IR_NoD_15Diab_16Diab_17
27
Principal Components Analysis
• Given a KxN data matrix containing K (correlated) measurements on N samples (objects/individuals…)
• Decomposes data matrix in new K components that – account for different sources of variability in the data,– are uncorrelated, that is each component accounts for a
different source of variability,– have decreasing explanatory ability: each component explains
more than the following– allow for a lower dimensional representation of the data in
terms of scores on principal components.
How does PCA work
• PCA provides a new set of coordinates for the observations• Original coordinates
•Value of the variables• New coordinates
•Value of PCs: scores• Scores are the new
coordinates in the orthogonal system defined by PCs.
X1
X2
Representing data in the PCA space
• PCs have been derived so that– They are orthogonal– Each PC explains the maximum amount of remaining
variation in the data• This means that it is not necessary to use all
PCs to visualize the data in this new coordinate system– Taking the first PCs will often explain a high
percentage of variability.– Usually only first 2 or 3– This should always be checked!!!
31
Multiple Factor Analysis (MFA)
MFA is a multivariate statistical technique useful to analyze several groups of variables
(numerical and/or categorical) defined on the same samples
31
32
Multiple Factor Analysis (2) The core of MFA is a PCA
applied to the whole set of variables,
Each group of variables is weighted, rendering possible the analysis of different points of view by taking them equally into account.
MFA allows to look for common factors by providing a representation of each matrix of variables.
32
33
MFA (3): Multiple displays
33
34
MFA (4): Supplementary info
The assets of MFA appear when integrating both numerical and categorical groups of variables, and when supplementary groups of data need to be added in the
analysis.
Conclusions
The good◦ MFA allows the integrated analysis of multiple groups of
possibly heterogeneous data types. ◦ It can help to highlight associations previously
undetected (“adds value”).◦ It can deal with any number of groups and any type of
supplementary variables (Gene Sets, Species, …) Limitations: ◦ It assumes individual-based information No groups
(e.g. pools) as input◦ Missings are difficult to deal with
Complementary idea 1: Improve use of biological knowledge
• The ultimate goal is a better understanding of (changes) in biological processes.
• It seems reasonable to make an (increased) use of biological information.
• This can be done in different ways– Convert data into networks and align them– Project biological units in a common space and rely on
• commonalities• differencesfor variable selection
Previous results
37
goProfiles
38
Variable selection based on Biological Knowledge
• Preliminary work on functional profiling can be used to project biological units such as genes or proteins into annotation databases such as the Gene Ontology
• An iterative algorithm can be used to select subsets that are either – most biologically diverse– nost biologically homogeneous
• This can be used as a basis for variable selection previous to MFA
Integrative Omics Data Mining and Knowledge Discovery in
Colorectal Cancer
based on a work by Jake Y. Chen, Ph.DIndiana Center for Systems Biology & Personalized Medicine
Polyp and Colorectal Cancer
Polyp vs. Colorectal CancerBenign tumors of the large intestine.Does not invade nearby tissue or spread to other parts of
the body.If not removed from the large intestine, may become
malignant (cancerous) over time.Most of the cancers of the large intestine are believed to
have developed from Polyp.Photo Courtesy of National Cancer Institute
Colon Cancer vs. Rectal Cancer• Share many commonalities, including molecular mechanisms.• Tend to be treated differently.
Omics/Clinical Data SourceProteomics/Metabolomics/Lipdomics/Clinical Data
Diet
H=70
PP=54
CR=29
N=153
Oxidative Stress
H=50
PP=32
CR=12
N=94
LC-MS Proteomics
H=80
PR=72
CR=40
N=192
Vitamin D
H=83
PP=81
CR=31
N=195
GC/GC MS Metabolomics
H=83
PP=84
CR=30
N=197
Lipdomics
H=47
PP=35
CR=15
N=97
NMR Metabolomics
H=53
PP=35
CR=15
N=103
Scientific Questions to Answer
Data AnalysisWhich Omics data has the best prediction power?Which features in Omics data are important?
Data MiningDoes integration of Omics data improve the prediction?Which combination of Omics data has the best prediction power?
Knowledge DiscoveryWhy those features in Omics data have the best prediction power?
RoadmapKnowledge Discovery of Proteomics DataKnowledge Discovery of Metabolomics DataIntegrative Data Mining
Proteomics Data Description
Group: Bindley Biosciences Center at Purdue University
Instruments: Agilent's chip cube coupled the XCT PLUS ESI ion trap
Data format at CCE webportal: mzXML
Number of Samples: Normal: 80; PolyP:72; Colorectal: 40
LC-MS Proteomics Data Processing
LC/MS data “heat map”
Total Ion Chromatogram (TIC) summarized from enhanced heat map
Methods Adapted fromN. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp. 3066.S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp. 75-83
Image Enhanced LC/MS data “heat map”
LC-MS Major Protein Identification~25-28 characteristic proteins /sample identified
Identify Most Informative TIC R.T. “Grid”
Apply the R.T. Grid to Original SpectraUse Mascot to Search for Protein ID at R.T. Grid Regions
No Scan RT Uniprot_ID Score Expect Evidence1 119 139.48 ADAD2_HUMAN 38 3.3 02 229 265.87 NNMT_HUMAN 43 1.1 23 372 429.15 ZSA5D_HUMAN 42 1.2 04 656 749.8 BRAF_HUMAN 40 2.2 4795 1162 1276.6 RGS7_HUMAN 47 0.39 16 1310 1407.2 TTC9C_HUMAN 35 6.3 07 1669 1713.9 CP042_HUMAN 38 3.1 08 1866 1879.1 HXD11_HUMAN 34 8.4 09 1987 1980.3 ING4_HUMAN 38 3.1 2
10 2114 2086 ZN423_HUMAN 33 10 011 2353 2285.7 CL065_HUMAN 37 3.9 012 2539 2441.3 CA5BL_HUMAN 47 0.4 113 2722 2594.7 NPDC1_HUMAN 38 3.6 014 2874 2722.2 DJC27_HUMAN 37 3.8 015 3001 2828.5 BORG4_HUMAN 40 2.2 116 3165 2965.1 KC1G1_HUMAN 27 43 017 3440 3196.1 TPPC5_HUMAN 40 2 018 3656 3377.6 UB2D3_HUMAN 43 0.99 119 3997 3665.5 TM208_HUMAN 34 8.1 020 4257 3885.4 ZBED3_HUMAN 29 23 0
Proteomics Result Interpretation
Proteins Identified from Colon Cancer and Health Group
Uniprot_ID
Frequency in Colon
(10)
Frequency in Health
(10)Evidence in
PubMedBRAF_HUMAN 3 0 508DMP46_HUMAN 3 0 0NNMT_HUMAN 3 1 4MRP_HUMAN 1 3 0STK33_HUMAN 0 3 0
Uniprot_ID Gene Protein NameEvidence in
PubMed
BRAF1_HUMAN BRAFSerine/threonine-protein kinase B-raf 508
P53_HUMAN TP53 Cellular tumor antigen p53 443CD44_HUMAN CD44 CD44 antigen 411MDM2_HUMAN MDM2 E3 ubiquitin-protein ligase Mdm2 131BCR_HUMAN BCR Breakpoint cluster region protein 59LCK_HUMAN LCK Tyrosine-protein kinase Lck 29Q7RTZ3_HUMAN LCK Tyrosine-protein kinase Lck 29CAV1_HUMAN CAV1 Caveolin-1 21PNPH_HUMAN PNP Purine nucleoside phosphorylase 13CBL_HUMAN CBL E3 ubiquitin-protein ligase CBL 11
RAF1_HUMAN RAF1RAF proto-oncogene serine/threonine-protein kinase 10
CD38_HUMAN CD38 ADP-ribosyl cyclase 1 8NNMT_HUMAN NNMT Nicotinamide N-methyltransferase 4
IRAK1_HUMAN IRAK1Interleukin-1 receptor-associated kinase 1 3
DMPK_HUMAN DMPK Myotonin-protein kinase 2ITA5_HUMAN ITGA5 Integrin alpha-5 1ITB1_HUMAN ITGB1 Integrin beta-1 1ZAP70_HUMAN ZAP70 Tyrosine-protein kinase ZAP-70 1
Proteins Interacted with High-Frequency Proteins from Colon Cancer Group
Proteomics Result InterpretationA Network Biology Context
Protein Network Constructed from the Top 3 Differential Proteins
Green-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS. Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal") AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star)
Proteomics Result InterpretationA Biological Pathway Context
BRAF (Serine/threonine-protein kinase B-raf) plays major roles in Colorectal Cancer Pathway (KEGG data)
NNMT (Nicotinamide N-methyltransferase) is involved in Biological Oxidations/Phase II Conjugation/Methylation (from Reactome)
Proteomics Result InterpretationA Biological Pathway Context for NNMT
RoadmapKnowledge Discovery of Proteomics DataKnowledge Discovery of Metabolomics Data
NMR DataGCxGC MS Data
Integrative Data Mining
Metabolomics Data Description
Group: Daniel Raftery Laboratory at Purdue University
NMR DataInstruments: Bruker Avance 500MHz, NMRData format at CCE webportal: Excel spreadsheetNumber of Samples: Normal: 53; PolyP:35; Colorectal: 15
GCxGC MS Data Instruments: LECO Pegasus 4D GCxGC-TOF Data format at CCE webportal: Excel spreadsheetNumber of Samples: Normal: 83; Polyp: 84; Colorectal:30
NMR Data Analysis Workflow
Extract peaks’ ppm
Search AgainstHuman Metabolome Database (2.5) to identify metabolites
Report only significant metabolitesSample_ID 1 2Top1 Delta-Hexanolactone Delta-HexanolactoneTop2 Hypotaurine Hypotaurine
Top3 2,3-Diphosphoglyceric acid DiethanolamineTop4 Diethanolamine 3,7-Dimethyluric acid
Top5 3-Phosphoglyceric acid Methyl isobutyl ketoneTop6 3,7-Dimethyluric acid 1,3,7-Trimethyluric acid
Top7 1,3,7-Trimethyluric acid Cysteine-S-sulfateTop8 L-Allothreonine L-AllothreonineTop9Top10
Signal Processing
NMR Peak Metabolite Identificationusing Human Metabolomics Database
1) Input the peak lists
2) Get the metabolites; leave out those with fewer than 2 matches
Significant Metabolites Identified from NRM Metabolomics Data
Group MetabolitesPolyp vs Health D-Arabitol,D-Pantethine(2/35 vs 0/53)
Colorectal vs Polyp None
Colorectal vs Health D-Arabitol (2/15 vs 0/53)
Population Frequency =
Marker metabolites? Shared metabolites
D-Arabitol Identified from NMR ResultsInvolved in Pentose and Glucuronate Interconversions Pathways
RoadmapKnowledge Discovery of Proteomics DataKnowledge Discovery of Metabolomics Data
NMR DataGCxGC MS Data
Integrative Data Mining
Results from GCxGC MS Data IMetabolite identification is more straightforward
Polyp vs Healthy Colorectal vs Polyp Colorectal vs Healthy
Metabolites Metabolites Metabolites
Methanesulfinic acid, trimethylsilyl ester Acetic acid, (methoxyimino)-, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester
Propanoic acid, 2-(methoxyimino)-, trimethylsilyl ester
Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester
L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester
Hexanedioic acid, bis(2-ethylhexyl) ester Methanesulfinic acid, trimethylsilyl ester Cholesterol trimethylsilyl ether
Mefloquine Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester
Hexanoic acid, trimethylsilyl ester
Cyclohexane, 1,3,5-trimethyl-2-octadecyl- L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester
Tetradecanoic acid, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester
Hexanoic acid, 2-(methoxyimino)-, trimethylsilyl ester
psi,psi.-Carotene, 3,3',4,4'-tetradehydro-1,1',2,2'-tetrahydro-1,1'-dimethoxy-2,2'-dioxo-
Cyclohexane, 1,3,5-trimethyl-2-octadecyl- 3,6-Dioxa-2,7-disilaoctane, 2,2,4,7,7-pentamethyl-
Silanol, trimethyl-, pyrophosphate (4:1) Butanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester
Trimethylsilyl ether of glycerol L-Asparagine, N,N2-bis(trimethylsilyl)-, trimethylsilyl ester
Ethylbis(trimethylsilyl)amine
Cyclotrisiloxane, 2,4,6-trimethyl-2,4,6-triphenyl-
Benzene, (1-hexadecylheptadecyl)-
Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester
Results from GCxGC MS Data II
A. Polyp vs Healthy B. Polyp vs Colorectal C. Colorectal vs Healthy
Comparative Results (Intensity vs. Population)Marker Metabolite Panel Clustering of three groups
Intensity based Heat map
Population Frequency based Heat map
Metabolites identified from GCxGC MS ResultsInvolved in Fatty Acid Biosynthesis Pathways
RoadmapKnowledge Discovery of Proteomics DataKnowledge Discovery of Metabolomics DataIntegrative Data Mining
Data Set DescriptionDiet, Lipidomics, Oxidative and VD
# of features and the total # of subjects varies
Three classes are balanced to the least common denominatorHealthy vs. PolypHealthy vs. ColorectalPolyp vs. Colorectal
Diet Lipid Oxidative VD
Total Subjects 150 97 94 195
Total Features 38 49 3 2
Predictive Modeling Methods
Data PreprocessingFiltering outliers (three standard deviations away from mean)Data Normalization (transforming to the 0-1 range) Binned categorical data using Quantile binning method
Missing Value TreatmentReplaced with the mean value of the attribute in group
Support vector machines (SVM) Classifier KernelRadial Basis Function (RBF) kernel are used
Feature Selection MethodsApproach #1: Two sample unpaired T-tests at 5% significance level.Approach #2: SVM Attribute Evaluator with Ranker Algorithm. Features from T-tests are filtered using p-values
K-fold Cross-validation
Classification Model
Clean Dataset
Raw Dataset
HypothesisHypothesis
Hypothesis
Dietary Attributes as Predictors
Polyp vs. Healthy Colorectal vs. Healthy
2.38E-02
4.21E-01
4.11E-02
1.21E-01
2.53E-02
9.57E-01
3.71E-02
5.60E-02
SVM Predictor Accuracy = 64% SVM Predictor Accuracy = 65%
P-value P-value
Ice cream
Rice
Tea
Shellfish
Salad
Tomato
Egg
Milk
Lipidomics T-Tests ResultsSignificant Features Selected from T Test with their corresponding p value
Features Polyp vs. Healthy Polyp vs. Colorectal Colorectal vs. Healthy
16:0/18:1 PE 1.76E-02
24:1 Cer 6.90E-03
LPE 18:1 <1.00E-04
LPE 20:0 1.50E-03 2.00E-04
An-16:0 LPA 3.23E-02
An-18:1 LPA 3.38E-02 1.33E-02
AA 1.13E-02
18:2 LPA 1.13E-02 4.50E-03
20:4 LPA 2.40E-02
22:6 FA 4.28E-02 3.24E-02
LPE 16:0 3.08E-02 3.40E-03
LPE 18:0 3.90E-03 1.00E-04
LPE 18:1 2.18E-02
Integrating lipidomics with clinical features Performance comparisons
Accuracy(without pre-selection)
Accuracy(with t-test pre-selection)
Accuracy(automatic selection)
Polyp vs. Healthy
0.54 0.71 0.78
Colorectal vs. Healthy*
0.57 0.63 0.73
Polyp vs. Colorectal *
0.70 0.90 0.87
* Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported.
Accuracy
Polyp vs. Healthy
0.55
Colorectal vs. Healthy*
0.60
Polyp vs. Colorectal *
0.60
Without Clinical Features With Clinical Features
Messages
Individual Omics data set has variable predictive performance
Need thorough statistical filtering + biological knowledge integration to battle inherent high-level of data noise
Integration of different Omics data with clinical data can improve predictive performance
69
Network methods and data integration
Network methods
• [Obvious comment]: Networks are everywhere from social networks such as facebook, terrorism menaces or biochemical processes
• Network science is a (re-)emerging approach that relies on different approaches to modeling systems of interacting elements to describe, model and predict the behavior of diverse systems.
Biological systems modelling
Building and using networks
– Networks can be created from collecting interactions published in papers, or can be reconstructed directly from data.
– Different types of biological intracellular molecular networks can be represented by different types of graphs.
– Protein interaction networks and cell signaling networks can be connected to drugs and diseases
– Network representation can be used to integrate different datasets using genes as anchors
Network biology methods integrating biological data for translational science
Bebek G et al. Brief Bioinform 2012;13:446-459
An integrative -omics signaling network identification process workflow
Start with processing tissue-specific data (instrument outputs) Microarray data is normalized to make comparisons of expression levels and transformed to
select genes for further analysis. Genome-wide genotyping signals are analyzed to identify regions (and hence regional
genes) for both tumor and normal tissue (or non-cancerous cells). Next, genomic regions with significant aberrations are merged with their corresponding
microarray probes to create expression profiles. In this analysis step, expression profiles are used to calculate Pearson's
coexpression correlations among gene pairs. These results are fed into the Pathway Analysis Framework. Integrating gene–gene coexpression values, annotations from GO, known signaling
pathways, protein sequence information, PPI networks and protein subcellular co-localization data, pathways are predicted and filtered.
Significant pathway subnetworks are merged to form signaling networks connecting genes of interest.
The networks and genomic alterations identified are put together to create a descriptive functional network, creating a molecular basis for the cancer studied.
Network-based prioritization of candidate disease genes.
Bebek G et al. Brief Bioinform 2012;bib.bbr075
© The Author 2012. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com
Conclusions Data integration or -better- integrative analysis of 'omics data' is
a challenging topic with many open-problems. Current state: go study by study and consider nature of data
and type of question. Current approaches are diverse:
Machine learning, Dimension reduction, Pathway visualization,
Diverse open research lines, lot of space for improvements Yet to come:
the "integrator": automatical combination that clearly improves biologival interpretation.
Mathematical framework common to all problems Last but not least: Integrative analysis requires integrative work,
well inside the philosophy of Biostatnet or other collaborative networks.
Acknowledgments Statistics and Bioinformatics Research
Group at the Statistics department of the University of Barcelona.
The Biostatnet group and particularly Carmen Cadarso and Lupe Gomez
My colleagues at the Statistics and Bioinformatics Unit at the Vall d'Hebrón Research Institute
Unitat de Serveis Científico Tècnics (UCTS) at the Vall d'Hebrón Research Institute
78
Thank you for your attention!
79
top related