transmart community meeting 5-7 nov 13 - session 3: transmart a data warehouse for translational...
DESCRIPTION
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART a Data Warehouse for Translational Medicine at Takeda Pharmaceuticals International Dave Marberg, Takeda We have used the tranSMART platform to construct a warehouse containing data from several Takeda clinical trials, proprietary preclinical drug activity studies, 1600 Gene Expression Omnibus studies, and data from TCGA, CCLE, and other sources. All gene expression data has been globally normalized. We extended the tranSMART platform with a set of R function calls to enable cross-study queries and analysis via the rich toolset available in R. The utility of the data warehouse is exemplified by a study in which we built a predictive model for drug sensitivities. The model was trained on gene expression and IC50 data from cell lines and was found to correctly predict drug activity in oncology indications.TRANSCRIPT
tranSMART: a data warehouse for Translational Medicine at Takeda Pharmaceuticals International Co.
David MerbergBin LiWilliam Trepicchio
transMART Community WorkshopNovember 2013
2
Outline
• Takeda’s tranSMART instance
– Goal
– Data content
– Enhancements
• Case Studies – Models for predicting erlotinib and sorafenib efficacy
|○○○○ | DDMMYY
3
Takeda rationale for implementing tranSMART
• To provide a large, well organized, and integrated dataset consisting of MPI/Takeda proprietary data, outsourced data, and valuable public data.
• To provide an integrated environment for accessing clinical data and molecular profiling data– Low dimensional data – age, sex, weight, previous treatments, survival,
etc.– High dimensional data – gene expression microarray, SNP, mutation,
NGS
• To provide tools that will enable Medical and Discovery scientists to use this data warehouse for biomarker identification, patient stratification, and drug targeting disease prediction, etc.
|○○○○ | DDMMYY
4
Public data currently in Takeda tranSMART
• Gene Expression Omnibus (GEO)– Approximately 1600 studies– Approximately 200 key cancer studies manually curated; another ~150
cancer studies curated via text mining– Most GEO datasets are cancer studies, but there are also samples from
cardiovascular disease, metabolic diseases, hematopoietic diseases, and many others.
• The Cancer Genome Atlas (TCGA)– Gene expression, SNP, and clinical data from close to 1000 patients
(brain, lung, and ovarian cancer)
• Large cell line panels– The CCLE dataset, ~ 1000 cell lines, screened for 24 SOC drugs– The Sanger dataset, ~ 1000 cell lines, screened on > 100 SOC drugs
|○○○○ | DDMMYY
5
Proprietary data currently in Takeda tranSMART
• Velcade Trials– Clinical observations– Gene expression results– Mutation data
• Commissioned Studies– Oncopanel 240 – cell line response to Takeda and SOC compounds
• Drug response (IC50, EC50, cell cycle blocks, apoptosis induction, etc.)• Mutation status• Gene expression
– Oncotest – xenograft response to Takeda and SOC compounds• Drug response (IC50)• Mutation status• Gene expression• SNP
|○○○○ | DDMMYY
6
OncoPanel 240 (Ricerca/Eurofins Panlabs)
• 240 well-defined tumor cell lines representing diverse tumor types
• Drug sensitivity screen results (IC50, EC50)– for 13 Standard of Care anti-tumor compounds – for 8 Takeda compounds targeting diverse pathways
• Baseline gene expression• Mutation data
|○○○○ | DDMMYY
7
Normalization of information in the data warehouse
• Gene expression data– Globally normalized GEO gene expression data using frozen Robust
Multiarray Analysis (fMRA), • Quantile based normalization• Currently, only selected Affymetrix platforms are globally normalized
– Enabled grouping gene expression results from different labs and different studies by disease
• Clinical information– Curate clinical information to create consistent vocabulary
|○○○○ | DDMMYY
8
R interface
• Enable direct access to tranSMART database tables– Eliminates some limitations of web interface, E.g. inability to perform multi-
study queries and analyses.– Provide a connection to the R environment, including diverse analysis
packages
• Sample functions– getDistinctConcepts – given a keyword/string, returns study codes for
matching clinical concepts in the tranSMART database– getGEXdata – given study codes, gets Gene Expression data from the
tranSMART database.
|○○○○ | DDMMYY
> br_concepts <-transmart.getDistinctConcepts(,'Breast_Cancer')> study_list <- unique(br_concepts$STUDYCODE)> ITGB2_GEP_BR2 <-
transmart.getGEXData(study_list, gene.list='ITGB2', data.pivot=F)
> hist(ITGB2_GEP_BR2$LOG_INTENSITY, br=50, xlim=c(5,12), main="All ITGB2 GEP", xlab="GEP")
9
Summary
• A data warehouse with a large store of gene expression, SNP, and phenotypic data– Clinical samples and cell lines– Data normalized so that comparisons across studies are meaningful– Vocabulary standardized across studies
• An R-interface to facilitate cross-study analysis using a large collection of methods from statistics and machine learning
• A “toolbox” for achieving key Translational Medicine goals– Bridging the gap between “omic” data generated in preclinical studies and
clinical results– Predicting drug efficacy using clinical and pre-clinical information collected
for different purposes
• Case studies in using this toolbox follow . . .
|○○○○ | DDMMYY
10
Building and using a model to predict drug sensitivity
|○○○○ | DDMMYY
?
???
Can we identify arelationship betweenbaseline gene expressionand drug sensitivity in cell lines . . .
. . . and then extrapolate from that relationship to use geneexpression to predictdrug efficacy in the clinic?
0 50 100 150 200
01
23
4
MLN7243 IC50 distribution on Ricerca panel
Cell linesIC
50s
11
Building the predictive models
• Normalize all Oncopanel 240 expression data• Remove low-intensity and low-variance genes (to get robust signal)• Correlation based feature selection (gene expression vs IC50s)• Develop a methodology for deriving drug sensitivity models
– Based on Partial Least Squares Regression (PLSR)– Captures consensus information from cancer cell line panel data
• Use two SOC drugs as proof of concept for methodology – Predict erlotinib (inhibits EGFR) sensitivity– Predict sorafenib (inhibits VEGFR and PDGFR) sensitivity– Use PFS from BATTLE trial to evaluate performance of models
|○○○○ | DDMMYY
Oncopanel 240Expression data
Oncopanel 240drug sensitivity
0 50 100 150 200
01
23
4
MLN7243 IC50 distribution on Ricerca panel
Cell lines
IC50
s
12
Accuracy of the erlotinib sensitivity model
|○○○○ | DDMMYY
Re-predicting Oncopanel 240 log2(IC50)
Accuracy estimation:Upper boundary: 91%Lower boundary: 77%
Signature genes over-connected to EGFRSignature genes over-representing pathwaysthat contains an EGFR node
• Also, EGFR ligand NRG1 is among the signature genes
EGFR
Signature genes in the Erlotinib model reflect known drug mechanism
14
Real data tests of the models
• Test 1: The BATTLE clinical trial– 255 lung cancer (NSCLC) patients, 131 with gene expression profile
data (GSE33072)• 25 patients in erlotinib arm• 39 patients in sorafenib arm
– Are the predictions of the PLSR models consistent with the results of the BATTLE trial?
• Test 2: Predicting drug sensitivity across indications– Use model to predict erlotinib and sorafenib sensitivity based on gene
expression data from 484 Gene Expression Omnibus datasets in Takeda tranSMART instance
• 11,331 samples grouped into 19 major oncology indications• Calculate percentage predicted drug sensitive tumors for each indication• Compare predictions to results of phase III clinical trials and FDA approvals
|○○○○ | DDMMYY
15
Test 1 – The BATTLE Trial: Survival analysis of groups predicted to be drug sensitive/resistant by PLSR model
|○○○○ | DDMMYY
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
Monthes from Start of Therapy
Pro
port
ion
of C
ases
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Monthes from Start of Therapy
Pro
port
ion
of C
ases
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Monthes from Start of Therapy
Pro
port
ion
of C
ases
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
Monthes from Start of Therapy
Pro
port
ion
of C
ases
P = 0.09HR = 0.43
P = 0.006HR = 0.32
P = 0.54HR = 1.32
P = 0.32HR = 1.87
E_model pred E_PFS S_model pred S_PFS
E_model pred S_PFS S_model pred E_PFS
(A)
(D)
(B)
(C)
E: Erlotinib; S: Sorafenib; red: predicted sensitive; green: predicted resistant
Test 2: Are predictions of erlotinib sensitivity, grouped by indication, consistent with clinical results?
16
Kidney cancer is predictedto be Erlotinib insensitive - a phase III clinical trial failed
Lung cancer is predictedto be erlotinib sensitive,a phase III clinical trial succeeded,(companion diagnostic available)
Potential new indication?Multiple head and neck cancertrials are going on now
Test 2: Are predictions of sorafenib sensitivity, grouped by indication, consistent with clinical results?
17
Kidney and Liver cancers are predicted to be Sorafenib sensitive
Sorafenib has been approved for Kidney and Liver cancers
Potential new indication?
18
Conclusions
• Using tranSMART, we created a large data warehouse to provide computational support for biomarker identification, patient stratification, and other Translational Medicine goals.
• Patient and cell line data can be grouped across studies by indication or other attributes to increase statistical power. Grouping is enabled by:– Global normalization of numeric data – Standardization of vocabulary– An R interface that provides direct access to database tables
• Using erlotinib and sorafenib as case studies, we demonstrated that the data warehouse and the R interface enable us to predict patient stratification and drug efficacy in cancer indications.
|○○○○ | DDMMYY
19
Acknowledgements
|○○○○ | DDMMYY
TakedaAndy DornerGene ShinAndrew KruegerSeema GroverJike Cui (now at Sanofi)
Recombinant by DeloitteJinlei LiuMike McDuffieHiaping Xia
Thomson ReutersElona Kolpakova-Hart
20|○○○○ | DDMMYY
Backup Slides
Model test 2: How well do the models predicts predict drug-indication efficacy profile?
21
Cancer Type
Successful
Phase III trial -FDA approval
Number of samples
% tumors predicted Erlotinib sensitive
% tumors predicted Sorafenib sensitive
Lung Cancer Erlotinib 329 15.81 0.61
Liver Cancer Sorafenib 85 0.00 31.76
Kidney Cancer Sorafenib 218 0.46 * 24.77
* Erlotinib failed to show efficacy for kidney cancer in a phase III trial