subphenotyping in tcga data
TRANSCRIPT
Cake TalkSubphenotyping in TCGA data
James McMurray
PhD StudentDepartment of Empirical Inference
22/10/2014
Outline of talk
BackgroundTCGA ProjectSubphenotypingGeneral idea
Example studyOverviewData
ReplicationOverviewAnalysis of larger dataset
Future WorkDeep models
Conclusion
TCGA Project
I The Cancer Genome AtlasI Public multi-omics data:
I SNPs (restricted)I Gene Expression arraysI RNASeqI Copy Number VariationI DNA MethylationI miRNASeqI Proteomics
I Many different types of cancer including GBM (brain), BRCA(breast cancer), KIRC/KIRP (kidney cancer), etc.
I Aim to find links between various types of cancer
I Improve understanding of molecular basis
TCGA Overview
What is subphenotyping?
I Identify sub-types to broad phenotypes - group patients bythese
I Clustering of patients - population structure
I Sub-disease classification
I Helps to provide intuition about molecular basis
I Diagnostic biomarkers
I Provide specific candidate drug targets
I Improve precision of medicine
I Unsupervised Learning
General idea
1. Cluster tumour samples based on some biomarkers (e.g.variations in gene expression)
2. Find the most significant differences between clusters (i.e. ingene expression) and if the clusters correspond to clinicaldifferences (i.e. in survival time)
3. Carry out a Gene Ontology Enrichment analysis (i.e. find ifcertain functional classes of genes are over-expressed orunder-expressed in the clusters)
4. If so, investigate possible causal pathways and identify drugtargets (i.e. genes which might have an effect if knocked-outin the tumour)
GO Example
Outline of talk
BackgroundTCGA ProjectSubphenotypingGeneral idea
Example studyOverviewData
ReplicationOverviewAnalysis of larger dataset
Future WorkDeep models
Conclusion
Example study
I Verhaak, R.G., et al. (2010) Integrated genomic analysisidentifies clinically relevant subtypes of glioblastomacharacterized by abnormalities in PDGFRA, IDH1, EGFR, andNF1. Cancer Cell. 17(1):98-110
I Previously identified four sub-types of GBM (GlioBlastomaMultiforme) using factor analysis and consensus clustering
I ProneuralI NeuralI ClassicalI Mesenchymal
I Most significant genes were PDGFRA, IDH1, EGFR, and NF1.
I Glioblastoma multiforme (GBM) is the most common form ofmalignant brain cancer in adults
I Affected patients have a poor prognosis with a mediansurvival of one year
Gene Expression differences
I Gene expression differences:
Clinical differences
I Clinical differences:
Gene Ontology Enrichment
I Gene Ontology (GO) Enrichment:
Data
I Patients with GBM cancer
I 202 samples with three gene expression measurements of11,861 genes.
I Note we could also include RNASeq which is another measureof Gene Expression
I Neglected due to the size of the data and the availablesamples
I Note that not all expression arrays measure the same genes sothere is some missing data
I If we wanted to use more samples we need to deal withmissing gene expression measurements across samples too
Data
I Lots of missing data:
Outline of talk
BackgroundTCGA ProjectSubphenotypingGeneral idea
Example studyOverviewData
ReplicationOverviewAnalysis of larger dataset
Future WorkDeep models
Conclusion
Overview
I Wanted to replicate the study using other dimensionalityreduction and clustering methods to test robustness.
I Used other TCGA GBM samples, and the data of theaforementioned paper.
I Other samples: 473 samples of 17,430 genes
I Verhaak, et al.: 202 samples of 11,861 genes.
I Used GPLVM for dimensionality reduction then k-means forclustering.
Clustering with GPLVM: 2D
I Larger dataset clustered with k-means on 2d latent space
Clustering with GPLVM: 3D
Most significantly different genes
I The expression of the genes (rows) across the GBM samples(columns). The magenta lines delineate the clusters.
I Note different genes to Verhaak, et al.
Clinical differences
Cluster Total dead Mean survival time of the dead (days)
Red 91/120 (75.8%) 40.2Green 65/82 (79.3%) 41.8Yellow 139/206 (67.5%) 37.7Blue 47/65 (72.3%) 528.4
I The mean survival time of those who died demonstratesclinical differences between the clusters
Clinical differences: Survival curves
I Also observe difference in survival curves:
Clinical differences: Survival curves
I Looking only at those who died:
Checking for artefacts: Tissue Source Site
I Clusters do not seem to correspond solely to Tissue SourceSite (source lab of sample)
Outline of talk
BackgroundTCGA ProjectSubphenotypingGeneral idea
Example studyOverviewData
ReplicationOverviewAnalysis of larger dataset
Future WorkDeep models
Conclusion
Future Work
I Still need to carry out Gene Ontology analysis and analyseclinical data more thoroughly (e.g. producing survival graphs)
I Compare results thoroughly with the results of Verhaak, et al.
I Repeat analysis on their dataset (mostly finished but omittedhere due to time constraints)
I Possible application of Deep Learning?
Deep Probablistic Models
I Hierarchical GPLVM example with stick figure motion:
Deep Probablistic Models
I TCGA data also has hierarchy:
Outline of talk
BackgroundTCGA ProjectSubphenotypingGeneral idea
Example studyOverviewData
ReplicationOverviewAnalysis of larger dataset
Future WorkDeep models
Conclusion
Conclusion
I Sub-phenotyping of cancer is important for discoveringclinically distinct sub-populations, and possible drug targetsfor treatments.
I Started analysis of GBM cancer data due to possiblecomparison with previously published work by Verhaak, et al.
I Main contributions of Machine Learning:I Feature selectionI Dimensionality reductionI ClusteringI Handling missing dataI Principled data fusion
I Any suggestions for these tasks would be appreciated
Thanks for your time
Questions?