subphenotyping in tcga data

Cake TalkSubphenotyping in TCGA data

James McMurray

PhD StudentDepartment of Empirical Inference

22/10/2014

Outline of talk

BackgroundTCGA ProjectSubphenotypingGeneral idea

Example studyOverviewData

ReplicationOverviewAnalysis of larger dataset

Future WorkDeep models

Conclusion

TCGA Project

I The Cancer Genome AtlasI Public multi-omics data:

I SNPs (restricted)I Gene Expression arraysI RNASeqI Copy Number VariationI DNA MethylationI miRNASeqI Proteomics

I Many different types of cancer including GBM (brain), BRCA(breast cancer), KIRC/KIRP (kidney cancer), etc.

I Aim to find links between various types of cancer

I Improve understanding of molecular basis

TCGA Overview

What is subphenotyping?

I Identify sub-types to broad phenotypes - group patients bythese

I Clustering of patients - population structure

I Sub-disease classification

I Helps to provide intuition about molecular basis

I Diagnostic biomarkers

I Provide specific candidate drug targets

I Improve precision of medicine

I Unsupervised Learning

General idea

1. Cluster tumour samples based on some biomarkers (e.g.variations in gene expression)

2. Find the most significant differences between clusters (i.e. ingene expression) and if the clusters correspond to clinicaldifferences (i.e. in survival time)

3. Carry out a Gene Ontology Enrichment analysis (i.e. find ifcertain functional classes of genes are over-expressed orunder-expressed in the clusters)

4. If so, investigate possible causal pathways and identify drugtargets (i.e. genes which might have an effect if knocked-outin the tumour)

GO Example

Outline of talk





Conclusion

Example study

I Verhaak, R.G., et al. (2010) Integrated genomic analysisidentifies clinically relevant subtypes of glioblastomacharacterized by abnormalities in PDGFRA, IDH1, EGFR, andNF1. Cancer Cell. 17(1):98-110

I Previously identified four sub-types of GBM (GlioBlastomaMultiforme) using factor analysis and consensus clustering

I ProneuralI NeuralI ClassicalI Mesenchymal

I Most significant genes were PDGFRA, IDH1, EGFR, and NF1.

I Glioblastoma multiforme (GBM) is the most common form ofmalignant brain cancer in adults

I Affected patients have a poor prognosis with a mediansurvival of one year

Gene Expression differences

I Gene expression differences:

Clinical differences

I Clinical differences:

Gene Ontology Enrichment

I Gene Ontology (GO) Enrichment:

Data

I Patients with GBM cancer

I 202 samples with three gene expression measurements of11,861 genes.

I Note we could also include RNASeq which is another measureof Gene Expression

I Neglected due to the size of the data and the availablesamples

I Note that not all expression arrays measure the same genes sothere is some missing data

I If we wanted to use more samples we need to deal withmissing gene expression measurements across samples too

Data

I Lots of missing data:

Outline of talk





Conclusion

Overview

I Wanted to replicate the study using other dimensionalityreduction and clustering methods to test robustness.

I Used other TCGA GBM samples, and the data of theaforementioned paper.

I Other samples: 473 samples of 17,430 genes

I Verhaak, et al.: 202 samples of 11,861 genes.

I Used GPLVM for dimensionality reduction then k-means forclustering.

Clustering with GPLVM: 2D

I Larger dataset clustered with k-means on 2d latent space

Clustering with GPLVM: 3D

Most significantly different genes

I The expression of the genes (rows) across the GBM samples(columns). The magenta lines delineate the clusters.

I Note different genes to Verhaak, et al.

Clinical differences

Cluster Total dead Mean survival time of the dead (days)

Red 91/120 (75.8%) 40.2Green 65/82 (79.3%) 41.8Yellow 139/206 (67.5%) 37.7Blue 47/65 (72.3%) 528.4

I The mean survival time of those who died demonstratesclinical differences between the clusters

Clinical differences: Survival curves

I Also observe difference in survival curves:

Clinical differences: Survival curves

I Looking only at those who died:

Checking for artefacts: Tissue Source Site

I Clusters do not seem to correspond solely to Tissue SourceSite (source lab of sample)

Outline of talk





Conclusion

Future Work

I Still need to carry out Gene Ontology analysis and analyseclinical data more thoroughly (e.g. producing survival graphs)

I Compare results thoroughly with the results of Verhaak, et al.

I Repeat analysis on their dataset (mostly finished but omittedhere due to time constraints)

I Possible application of Deep Learning?

Deep Probablistic Models

I Hierarchical GPLVM example with stick figure motion:

Deep Probablistic Models

I TCGA data also has hierarchy:

Outline of talk





Conclusion

Conclusion

I Sub-phenotyping of cancer is important for discoveringclinically distinct sub-populations, and possible drug targetsfor treatments.

I Started analysis of GBM cancer data due to possiblecomparison with previously published work by Verhaak, et al.

I Main contributions of Machine Learning:I Feature selectionI Dimensionality reductionI ClusteringI Handling missing dataI Principled data fusion

I Any suggestions for these tasks would be appreciated

Thanks for your time

Questions?

subphenotyping in tcga data

Science

gene expression2

ingene expression

different types of cancer

measureof gene expressioni

brcabreast cancer

significant genes

cancer cell

datai patients