integrative colorectal cancer omics data mining and knowledge discovery jake y. chen, ph.d. iupui...

Integrative Colorectal Cancer Omics Data Mining and

Knowledge Discovery

Jake Y. Chen, Ph.D.IUPUI

Indiana Center for Systems Biology & Personalized Medicine

http://bio.informatics.iupui.edu

Polyp and Colorectal Cancer

Polyp vs. Colorectal Cancer• Benign tumors of the large intestine.• Does not invade nearby tissue or spread to other

parts of the body.• If not removed from the large intestine, may become

malignant (cancerous) over time.• Most of the cancers of the large intestine are

believed to have developed from Polyp.Photo Courtesy of National Cancer Institute

Colon Cancer vs. Rectal Cancer• Share many commonalities, including molecular mechanisms.• Tend to be treated differently.

Colorectal Cancer Molecular Pathways

A. Walther, et al. (2009) Nature Reviews Cancer, 9(7) pp. 489-99

Omics/Clinical Data SourceProteomics/Metabolomics/Lipdomics/Clinical Data

Diet

H=70

PP=54

CR=29

N=153

Oxidative Stress

H=50

PP=32

CR=12

N=94

LC-MS Proteomics

H=80

PR=72

CR=40

N=192

Vitamin D

H=83

PP=81

CR=31

N=195

GC/GC MS Metabolomics

H=83

PP=84

CR=30

N=197

Lipdomics

H=47

PP=35

CR=15

N=97

NMR Metabolomics

H=53

PP=35

CR=15

N=103

Scientific Questions to Answer

Data Analysis• Which Omics data has the best prediction power?• Which features in Omics data are important?

Data Mining• Does integration of Omics data improve the prediction?• Which combination of Omics data has the best prediction power?

Knowledge Discovery• Why those features in Omics data have the best prediction power?

Roadmap

Knowledge Discovery of Proteomics Data

Knowledge Discovery of Metabolomics Data Integrative Data Mining

Proteomics Data Description

Group: Bindley Biosciences Center at Purdue University

Instruments: Agilent's chip cube coupled the XCT PLUS ESI ion trap

Data format at CCE webportal: mzXML Number of Samples: Normal: 80; PolyP:72;

Colorectal: 40

LC-MS Proteomics Data Processing

LC/MS data “heat map”

Total Ion Chromatogram (TIC) summarized from enhanced heat map

Methods Adapted fromN. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp. 3066.S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp. 75-83

Image Enhanced LC/MS data “heat map”

LC-MS Major Protein Identification~25-28 characteristic proteins /sample identified

Identify Most Informative TIC R.T. “Grid”

Apply the R.T. Grid to Original Spectra

Use Mascot to Search for Protein ID at R.T. Grid Regions

No Scan RT Uniprot_ID Score Expect Evidence

1 119 139.48 ADAD2_HUMAN 38 3.3 0

2 229 265.87 NNMT_HUMAN 43 1.1 2

3 372 429.15 ZSA5D_HUMAN 42 1.2 0

4 656 749.8 BRAF_HUMAN 40 2.2 479

5 1162 1276.6 RGS7_HUMAN 47 0.39 1

6 1310 1407.2 TTC9C_HUMAN 35 6.3 0

7 1669 1713.9 CP042_HUMAN 38 3.1 0

8 1866 1879.1 HXD11_HUMAN 34 8.4 0

9 1987 1980.3 ING4_HUMAN 38 3.1 2

10 2114 2086 ZN423_HUMAN 33 10 0

11 2353 2285.7 CL065_HUMAN 37 3.9 0

12 2539 2441.3 CA5BL_HUMAN 47 0.4 1

13 2722 2594.7 NPDC1_HUMAN 38 3.6 0

14 2874 2722.2 DJC27_HUMAN 37 3.8 0

15 3001 2828.5 BORG4_HUMAN 40 2.2 1

16 3165 2965.1 KC1G1_HUMAN 27 43 0

17 3440 3196.1 TPPC5_HUMAN 40 2 0

18 3656 3377.6 UB2D3_HUMAN 43 0.99 1

19 3997 3665.5 TM208_HUMAN 34 8.1 0

20 4257 3885.4 ZBED3_HUMAN 29 23 0

Proteomics Result Interpretation

Proteins Identified from Colon Cancer and Health Group

Uniprot_ID

Frequency in Colon

(10)

Frequency in Health

(10)Evidence in

PubMedBRAF_HUMAN 3 0 508DMP46_HUMAN 3 0 0NNMT_HUMAN 3 1 4MRP_HUMAN 1 3 0STK33_HUMAN 0 3 0

Uniprot_ID Gene Protein NameEvidence in

PubMed

BRAF1_HUMAN BRAFSerine/threonine-protein kinase B-raf 508

P53_HUMAN TP53 Cellular tumor antigen p53 443CD44_HUMAN CD44 CD44 antigen 411MDM2_HUMAN MDM2 E3 ubiquitin-protein ligase Mdm2 131BCR_HUMAN BCR Breakpoint cluster region protein 59LCK_HUMAN LCK Tyrosine-protein kinase Lck 29Q7RTZ3_HUMAN LCK Tyrosine-protein kinase Lck 29CAV1_HUMAN CAV1 Caveolin-1 21PNPH_HUMAN PNP Purine nucleoside phosphorylase 13CBL_HUMAN CBL E3 ubiquitin-protein ligase CBL 11

RAF1_HUMAN RAF1RAF proto-oncogene serine/threonine-protein kinase 10

CD38_HUMAN CD38 ADP-ribosyl cyclase 1 8NNMT_HUMAN NNMT Nicotinamide N-methyltransferase 4

IRAK1_HUMAN IRAK1Interleukin-1 receptor-associated kinase 1 3

DMPK_HUMAN DMPK Myotonin-protein kinase 2ITA5_HUMAN ITGA5 Integrin alpha-5 1ITB1_HUMAN ITGB1 Integrin beta-1 1ZAP70_HUMAN ZAP70 Tyrosine-protein kinase ZAP-70 1

Proteins Interacted with High-Frequency Proteins from Colon Cancer Group

Proteomics Result InterpretationA Network Biology Context

Protein Network Constructed from the Top 3 Differential Proteins

Green-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS. Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal") AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star)

Proteomics Result InterpretationA Biological Pathway Context

BRAF (Serine/threonine-protein kinase B-raf) plays major roles in Colorectal Cancer Pathway (KEGG data)

NNMT (Nicotinamide N-methyltransferase) is involved in Biological Oxidations/Phase II Conjugation/Methylation (from Reactome)

Proteomics Result InterpretationA Biological Pathway Context for NNMT

Roadmap

Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics

Data NMR Data GCxGC MS Data

Integrative Data Mining

Metabolomics Data Description

Group: Daniel Raftery Laboratory at Purdue University

1. NMR Data Instruments: Bruker Avance 500MHz, NMR Data format at CCE webportal: Excel spreadsheet Number of Samples: Normal: 53; PolyP:35; Colorectal: 15

2. GCxGC MS Data Instruments: LECO Pegasus 4D GCxGC-TOF Data format at CCE webportal: Excel spreadsheet Number of Samples: Normal: 83; Polyp: 84; Colorectal:30

NMR Data Analysis Workflow

Extract peaks’ ppm

Search Against Human Metabolome Database (2.5) to identify metabolites

Report only significant metabolites

Sample_ID 1 2Top1 Delta-Hexanolactone Delta-HexanolactoneTop2 Hypotaurine Hypotaurine

Top32,3-Diphosphoglyceric acid Diethanolamine

Top4 Diethanolamine 3,7-Dimethyluric acid

Top5 3-Phosphoglyceric acid Methyl isobutyl ketoneTop6 3,7-Dimethyluric acid 1,3,7-Trimethyluric acid

Top7 1,3,7-Trimethyluric acid Cysteine-S-sulfateTop8 L-Allothreonine L-AllothreonineTop9Top10

Signal Processing

NMR Peak Metabolite Identificationusing Human Metabolomics Database

1) Input the peak lists

2) Get the metabolites; leave out those with fewer than 2 matches

Significant Metabolites Identified from NRM Metabolomics Data

Group MetabolitesPolyp vs Health D-Arabitol,D-Pantethine(2/35 vs 0/53)

Colorectal vs Polyp None

Colorectal vs Health D-Arabitol (2/15 vs 0/53)

Population Frequency =

Marker metabolites? Shared metabolites

D-Arabitol Identified from NMR ResultsInvolved in Pentose and Glucuronate Interconversions Pathways

Roadmap

Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics

Data NMR Data GCxGC MS Data

Integrative Data Mining

Results from GCxGC MS Data IMetabolite identification is more straightforward Polyp vs Healthy Colorectal vs Polyp Colorectal vs Healthy

Metabolites Metabolites Metabolites

Methanesulfinic acid, trimethylsilyl ester Acetic acid, (methoxyimino)-, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester

Propanoic acid, 2-(methoxyimino)-, trimethylsilyl ester

Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester

L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester

Hexanedioic acid, bis(2-ethylhexyl) ester Methanesulfinic acid, trimethylsilyl ester Cholesterol trimethylsilyl ether

Mefloquine Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester

Hexanoic acid, trimethylsilyl ester

Cyclohexane, 1,3,5-trimethyl-2-octadecyl- L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester

Tetradecanoic acid, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester

Hexanoic acid, 2-(methoxyimino)-, trimethylsilyl ester

psi,psi.-Carotene, 3,3',4,4'-tetradehydro-1,1',2,2'-tetrahydro-1,1'-dimethoxy-2,2'-dioxo-

Cyclohexane, 1,3,5-trimethyl-2-octadecyl- 3,6-Dioxa-2,7-disilaoctane, 2,2,4,7,7-pentamethyl-

Silanol, trimethyl-, pyrophosphate (4:1) Butanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester

Trimethylsilyl ether of glycerol L-Asparagine, N,N2-bis(trimethylsilyl)-, trimethylsilyl ester

Ethylbis(trimethylsilyl)amine

Cyclotrisiloxane, 2,4,6-trimethyl-2,4,6-triphenyl-

Benzene, (1-hexadecylheptadecyl)-

Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester

Results from GCxGC MS Data II

A. Polyp vs Healthy B. Polyp vs Colorectal C. Colorectal vs Healthy

Comparative Results (Intensity vs. Population)Marker Metabolite Panel Clustering of three groups

Intensity based Heat map

Population Frequency based Heat map

Metabolites identified from GCxGC MS ResultsInvolved in Fatty Acid Biosynthesis Pathways

Roadmap

Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data Integrative Data Mining

Data Set Description

Diet, Lipidomics, Oxidative and VD # of features and the total # of subjects varies

Three classes are balanced to the least common denominator Healthy vs. Polyp Healthy vs. Colorectal Polyp vs. Colorectal

Diet Lipid Oxidative VD

Total Subjects 150 97 94 195

Total Features 38 49 3 2

Predictive Modeling Methods

Data Preprocessing Filtering outliers (three standard deviations away from mean) Data Normalization (transforming to the 0-1 range) Binned categorical data using Quantile binning method

Missing Value Treatment Replaced with the mean value of the attribute in group

Support vector machines (SVM) Classifier Kernel Radial Basis Function (RBF) kernel are used

Feature Selection Methods Approach #1: Two sample unpaired T-tests at 5% significance level. Approach #2: SVM Attribute Evaluator with Ranker Algorithm. Features from T-tests are filtered using p-values

K-fold Cross-validation

Classification Model

Clean Dataset

Raw Dataset

HypothesisHypothesis

Hypothesis

Dietary Attributes as Predictors

Polyp vs. Healthy Colorectal vs. Healthy

2.38E-02

4.21E-01

4.11E-02

1.21E-01

2.53E-02

9.57E-01

3.71E-02

5.60E-02

SVM Predictor Accuracy = 64% SVM Predictor Accuracy = 65%

P-value P-value

Ice cream

Rice

Tea

Shellfish

Salad

Tomato

Egg

Milk

Lipidomics T-Tests ResultsSignificant Features Selected from T Test with their corresponding p value

Features Polyp vs. Healthy Polyp vs. Colorectal Colorectal vs. Healthy

16:0/18:1 PE 1.76E-02

24:1 Cer 6.90E-03

LPE 18:1 <1.00E-04

LPE 20:0 1.50E-03 2.00E-04

An-16:0 LPA 3.23E-02

An-18:1 LPA 3.38E-02 1.33E-02

AA 1.13E-02

18:2 LPA 1.13E-02 4.50E-03

20:4 LPA 2.40E-02

22:6 FA 4.28E-02 3.24E-02

LPE 16:0 3.08E-02 3.40E-03

LPE 18:0 3.90E-03 1.00E-04

LPE 18:1 2.18E-02

Integrating lipidomics with clinical features Performance comparisons

Accuracy(without pre-selection)

Accuracy(with t-test pre-selection)

Accuracy(automatic selection)

Polyp vs. Healthy

0.54 0.71 0.78

Colorectal vs. Healthy*

0.57 0.63 0.73

Polyp vs. Colorectal *

0.70 0.90 0.87

* Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported.

Accuracy

Polyp vs. Healthy

0.55

Colorectal vs. Healthy*

0.60

Polyp vs. Colorectal *

0.60

Without Clinical Features

With Clinical Features

Messages

Individual Omics data set has variable predictive performance

Need thorough statistical filtering + biological knowledge integration to battle inherent high-level of data noise

Integration of different Omics data with clinical data can improve predictive performance

31

AcknowledgmentWe thank all the members in our team.

integrative colorectal cancer omics data mining and knowledge discovery jake y. chen, ph.d. iupui...

Documents

data analysis

integration of omics

combination of omics

colorectal cancer polyp

human29230 slide

lcms proteomics h

esi ion trap data format

gcgc ms metabolomics