integrative colorectal cancer omics data mining and knowledge discovery jake y. chen, ph.d. iupui...
TRANSCRIPT
Integrative Colorectal Cancer Omics Data Mining and
Knowledge Discovery
Jake Y. Chen, Ph.D.IUPUI
Indiana Center for Systems Biology & Personalized Medicine
http://bio.informatics.iupui.edu
Polyp and Colorectal Cancer
Polyp vs. Colorectal Cancer• Benign tumors of the large intestine.• Does not invade nearby tissue or spread to other
parts of the body.• If not removed from the large intestine, may become
malignant (cancerous) over time.• Most of the cancers of the large intestine are
believed to have developed from Polyp.Photo Courtesy of National Cancer Institute
Colon Cancer vs. Rectal Cancer• Share many commonalities, including molecular mechanisms.• Tend to be treated differently.
Colorectal Cancer Molecular Pathways
A. Walther, et al. (2009) Nature Reviews Cancer, 9(7) pp. 489-99
Omics/Clinical Data SourceProteomics/Metabolomics/Lipdomics/Clinical Data
Diet
H=70
PP=54
CR=29
N=153
Oxidative Stress
H=50
PP=32
CR=12
N=94
LC-MS Proteomics
H=80
PR=72
CR=40
N=192
Vitamin D
H=83
PP=81
CR=31
N=195
GC/GC MS Metabolomics
H=83
PP=84
CR=30
N=197
Lipdomics
H=47
PP=35
CR=15
N=97
NMR Metabolomics
H=53
PP=35
CR=15
N=103
Scientific Questions to Answer
Data Analysis• Which Omics data has the best prediction power?• Which features in Omics data are important?
Data Mining• Does integration of Omics data improve the prediction?• Which combination of Omics data has the best prediction power?
Knowledge Discovery• Why those features in Omics data have the best prediction power?
Roadmap
Knowledge Discovery of Proteomics Data
Knowledge Discovery of Metabolomics Data Integrative Data Mining
Proteomics Data Description
Group: Bindley Biosciences Center at Purdue University
Instruments: Agilent's chip cube coupled the XCT PLUS ESI ion trap
Data format at CCE webportal: mzXML Number of Samples: Normal: 80; PolyP:72;
Colorectal: 40
LC-MS Proteomics Data Processing
LC/MS data “heat map”
Total Ion Chromatogram (TIC) summarized from enhanced heat map
Methods Adapted fromN. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp. 3066.S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp. 75-83
Image Enhanced LC/MS data “heat map”
LC-MS Major Protein Identification~25-28 characteristic proteins /sample identified
Identify Most Informative TIC R.T. “Grid”
Apply the R.T. Grid to Original Spectra
Use Mascot to Search for Protein ID at R.T. Grid Regions
No Scan RT Uniprot_ID Score Expect Evidence
1 119 139.48 ADAD2_HUMAN 38 3.3 0
2 229 265.87 NNMT_HUMAN 43 1.1 2
3 372 429.15 ZSA5D_HUMAN 42 1.2 0
4 656 749.8 BRAF_HUMAN 40 2.2 479
5 1162 1276.6 RGS7_HUMAN 47 0.39 1
6 1310 1407.2 TTC9C_HUMAN 35 6.3 0
7 1669 1713.9 CP042_HUMAN 38 3.1 0
8 1866 1879.1 HXD11_HUMAN 34 8.4 0
9 1987 1980.3 ING4_HUMAN 38 3.1 2
10 2114 2086 ZN423_HUMAN 33 10 0
11 2353 2285.7 CL065_HUMAN 37 3.9 0
12 2539 2441.3 CA5BL_HUMAN 47 0.4 1
13 2722 2594.7 NPDC1_HUMAN 38 3.6 0
14 2874 2722.2 DJC27_HUMAN 37 3.8 0
15 3001 2828.5 BORG4_HUMAN 40 2.2 1
16 3165 2965.1 KC1G1_HUMAN 27 43 0
17 3440 3196.1 TPPC5_HUMAN 40 2 0
18 3656 3377.6 UB2D3_HUMAN 43 0.99 1
19 3997 3665.5 TM208_HUMAN 34 8.1 0
20 4257 3885.4 ZBED3_HUMAN 29 23 0
Proteomics Result Interpretation
Proteins Identified from Colon Cancer and Health Group
Uniprot_ID
Frequency in Colon
(10)
Frequency in Health
(10)Evidence in
PubMedBRAF_HUMAN 3 0 508DMP46_HUMAN 3 0 0NNMT_HUMAN 3 1 4MRP_HUMAN 1 3 0STK33_HUMAN 0 3 0
Uniprot_ID Gene Protein NameEvidence in
PubMed
BRAF1_HUMAN BRAFSerine/threonine-protein kinase B-raf 508
P53_HUMAN TP53 Cellular tumor antigen p53 443CD44_HUMAN CD44 CD44 antigen 411MDM2_HUMAN MDM2 E3 ubiquitin-protein ligase Mdm2 131BCR_HUMAN BCR Breakpoint cluster region protein 59LCK_HUMAN LCK Tyrosine-protein kinase Lck 29Q7RTZ3_HUMAN LCK Tyrosine-protein kinase Lck 29CAV1_HUMAN CAV1 Caveolin-1 21PNPH_HUMAN PNP Purine nucleoside phosphorylase 13CBL_HUMAN CBL E3 ubiquitin-protein ligase CBL 11
RAF1_HUMAN RAF1RAF proto-oncogene serine/threonine-protein kinase 10
CD38_HUMAN CD38 ADP-ribosyl cyclase 1 8NNMT_HUMAN NNMT Nicotinamide N-methyltransferase 4
IRAK1_HUMAN IRAK1Interleukin-1 receptor-associated kinase 1 3
DMPK_HUMAN DMPK Myotonin-protein kinase 2ITA5_HUMAN ITGA5 Integrin alpha-5 1ITB1_HUMAN ITGB1 Integrin beta-1 1ZAP70_HUMAN ZAP70 Tyrosine-protein kinase ZAP-70 1
Proteins Interacted with High-Frequency Proteins from Colon Cancer Group
Proteomics Result InterpretationA Network Biology Context
Protein Network Constructed from the Top 3 Differential Proteins
Green-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS. Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal") AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star)
Proteomics Result InterpretationA Biological Pathway Context
BRAF (Serine/threonine-protein kinase B-raf) plays major roles in Colorectal Cancer Pathway (KEGG data)
NNMT (Nicotinamide N-methyltransferase) is involved in Biological Oxidations/Phase II Conjugation/Methylation (from Reactome)
Proteomics Result InterpretationA Biological Pathway Context for NNMT
Roadmap
Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics
Data NMR Data GCxGC MS Data
Integrative Data Mining
Metabolomics Data Description
Group: Daniel Raftery Laboratory at Purdue University
1. NMR Data Instruments: Bruker Avance 500MHz, NMR Data format at CCE webportal: Excel spreadsheet Number of Samples: Normal: 53; PolyP:35; Colorectal: 15
2. GCxGC MS Data Instruments: LECO Pegasus 4D GCxGC-TOF Data format at CCE webportal: Excel spreadsheet Number of Samples: Normal: 83; Polyp: 84; Colorectal:30
NMR Data Analysis Workflow
Extract peaks’ ppm
Search Against Human Metabolome Database (2.5) to identify metabolites
Report only significant metabolites
Sample_ID 1 2Top1 Delta-Hexanolactone Delta-HexanolactoneTop2 Hypotaurine Hypotaurine
Top32,3-Diphosphoglyceric acid Diethanolamine
Top4 Diethanolamine 3,7-Dimethyluric acid
Top5 3-Phosphoglyceric acid Methyl isobutyl ketoneTop6 3,7-Dimethyluric acid 1,3,7-Trimethyluric acid
Top7 1,3,7-Trimethyluric acid Cysteine-S-sulfateTop8 L-Allothreonine L-AllothreonineTop9Top10
Signal Processing
NMR Peak Metabolite Identificationusing Human Metabolomics Database
1) Input the peak lists
2) Get the metabolites; leave out those with fewer than 2 matches
Significant Metabolites Identified from NRM Metabolomics Data
Group MetabolitesPolyp vs Health D-Arabitol,D-Pantethine(2/35 vs 0/53)
Colorectal vs Polyp None
Colorectal vs Health D-Arabitol (2/15 vs 0/53)
Population Frequency =
Marker metabolites? Shared metabolites
D-Arabitol Identified from NMR ResultsInvolved in Pentose and Glucuronate Interconversions Pathways
Roadmap
Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics
Data NMR Data GCxGC MS Data
Integrative Data Mining
Results from GCxGC MS Data IMetabolite identification is more straightforward Polyp vs Healthy Colorectal vs Polyp Colorectal vs Healthy
Metabolites Metabolites Metabolites
Methanesulfinic acid, trimethylsilyl ester Acetic acid, (methoxyimino)-, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester
Propanoic acid, 2-(methoxyimino)-, trimethylsilyl ester
Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester
L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester
Hexanedioic acid, bis(2-ethylhexyl) ester Methanesulfinic acid, trimethylsilyl ester Cholesterol trimethylsilyl ether
Mefloquine Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester
Hexanoic acid, trimethylsilyl ester
Cyclohexane, 1,3,5-trimethyl-2-octadecyl- L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester
Tetradecanoic acid, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester
Hexanoic acid, 2-(methoxyimino)-, trimethylsilyl ester
psi,psi.-Carotene, 3,3',4,4'-tetradehydro-1,1',2,2'-tetrahydro-1,1'-dimethoxy-2,2'-dioxo-
Cyclohexane, 1,3,5-trimethyl-2-octadecyl- 3,6-Dioxa-2,7-disilaoctane, 2,2,4,7,7-pentamethyl-
Silanol, trimethyl-, pyrophosphate (4:1) Butanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester
Trimethylsilyl ether of glycerol L-Asparagine, N,N2-bis(trimethylsilyl)-, trimethylsilyl ester
Ethylbis(trimethylsilyl)amine
Cyclotrisiloxane, 2,4,6-trimethyl-2,4,6-triphenyl-
Benzene, (1-hexadecylheptadecyl)-
Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester
Results from GCxGC MS Data II
A. Polyp vs Healthy B. Polyp vs Colorectal C. Colorectal vs Healthy
Comparative Results (Intensity vs. Population)Marker Metabolite Panel Clustering of three groups
Intensity based Heat map
Population Frequency based Heat map
Metabolites identified from GCxGC MS ResultsInvolved in Fatty Acid Biosynthesis Pathways
Roadmap
Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data Integrative Data Mining
Data Set Description
Diet, Lipidomics, Oxidative and VD # of features and the total # of subjects varies
Three classes are balanced to the least common denominator Healthy vs. Polyp Healthy vs. Colorectal Polyp vs. Colorectal
Diet Lipid Oxidative VD
Total Subjects 150 97 94 195
Total Features 38 49 3 2
Predictive Modeling Methods
Data Preprocessing Filtering outliers (three standard deviations away from mean) Data Normalization (transforming to the 0-1 range) Binned categorical data using Quantile binning method
Missing Value Treatment Replaced with the mean value of the attribute in group
Support vector machines (SVM) Classifier Kernel Radial Basis Function (RBF) kernel are used
Feature Selection Methods Approach #1: Two sample unpaired T-tests at 5% significance level. Approach #2: SVM Attribute Evaluator with Ranker Algorithm. Features from T-tests are filtered using p-values
K-fold Cross-validation
Classification Model
Clean Dataset
Raw Dataset
HypothesisHypothesis
Hypothesis
Dietary Attributes as Predictors
Polyp vs. Healthy Colorectal vs. Healthy
2.38E-02
4.21E-01
4.11E-02
1.21E-01
2.53E-02
9.57E-01
3.71E-02
5.60E-02
SVM Predictor Accuracy = 64% SVM Predictor Accuracy = 65%
P-value P-value
Ice cream
Rice
Tea
Shellfish
Salad
Tomato
Egg
Milk
Lipidomics T-Tests ResultsSignificant Features Selected from T Test with their corresponding p value
Features Polyp vs. Healthy Polyp vs. Colorectal Colorectal vs. Healthy
16:0/18:1 PE 1.76E-02
24:1 Cer 6.90E-03
LPE 18:1 <1.00E-04
LPE 20:0 1.50E-03 2.00E-04
An-16:0 LPA 3.23E-02
An-18:1 LPA 3.38E-02 1.33E-02
AA 1.13E-02
18:2 LPA 1.13E-02 4.50E-03
20:4 LPA 2.40E-02
22:6 FA 4.28E-02 3.24E-02
LPE 16:0 3.08E-02 3.40E-03
LPE 18:0 3.90E-03 1.00E-04
LPE 18:1 2.18E-02
Integrating lipidomics with clinical features Performance comparisons
Accuracy(without pre-selection)
Accuracy(with t-test pre-selection)
Accuracy(automatic selection)
Polyp vs. Healthy
0.54 0.71 0.78
Colorectal vs. Healthy*
0.57 0.63 0.73
Polyp vs. Colorectal *
0.70 0.90 0.87
* Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported.
Accuracy
Polyp vs. Healthy
0.55
Colorectal vs. Healthy*
0.60
Polyp vs. Colorectal *
0.60
Without Clinical Features
With Clinical Features
Messages
Individual Omics data set has variable predictive performance
Need thorough statistical filtering + biological knowledge integration to battle inherent high-level of data noise
Integration of different Omics data with clinical data can improve predictive performance
31
AcknowledgmentWe thank all the members in our team.