subphenotyping in tcga data

29
Cake Talk Subphenotyping in TCGA data James McMurray PhD Student Department of Empirical Inference 22/10/2014

Upload: james-mcmurray

Post on 17-Jul-2015

155 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Subphenotyping in TCGA data

Cake TalkSubphenotyping in TCGA data

James McMurray

PhD StudentDepartment of Empirical Inference

22/10/2014

Page 2: Subphenotyping in TCGA data

Outline of talk

BackgroundTCGA ProjectSubphenotypingGeneral idea

Example studyOverviewData

ReplicationOverviewAnalysis of larger dataset

Future WorkDeep models

Conclusion

Page 3: Subphenotyping in TCGA data

TCGA Project

I The Cancer Genome AtlasI Public multi-omics data:

I SNPs (restricted)I Gene Expression arraysI RNASeqI Copy Number VariationI DNA MethylationI miRNASeqI Proteomics

I Many different types of cancer including GBM (brain), BRCA(breast cancer), KIRC/KIRP (kidney cancer), etc.

I Aim to find links between various types of cancer

I Improve understanding of molecular basis

Page 4: Subphenotyping in TCGA data

TCGA Overview

Page 5: Subphenotyping in TCGA data

What is subphenotyping?

I Identify sub-types to broad phenotypes - group patients bythese

I Clustering of patients - population structure

I Sub-disease classification

I Helps to provide intuition about molecular basis

I Diagnostic biomarkers

I Provide specific candidate drug targets

I Improve precision of medicine

I Unsupervised Learning

Page 6: Subphenotyping in TCGA data

General idea

1. Cluster tumour samples based on some biomarkers (e.g.variations in gene expression)

2. Find the most significant differences between clusters (i.e. ingene expression) and if the clusters correspond to clinicaldifferences (i.e. in survival time)

3. Carry out a Gene Ontology Enrichment analysis (i.e. find ifcertain functional classes of genes are over-expressed orunder-expressed in the clusters)

4. If so, investigate possible causal pathways and identify drugtargets (i.e. genes which might have an effect if knocked-outin the tumour)

Page 7: Subphenotyping in TCGA data

GO Example

Page 8: Subphenotyping in TCGA data

Outline of talk

BackgroundTCGA ProjectSubphenotypingGeneral idea

Example studyOverviewData

ReplicationOverviewAnalysis of larger dataset

Future WorkDeep models

Conclusion

Page 9: Subphenotyping in TCGA data

Example study

I Verhaak, R.G., et al. (2010) Integrated genomic analysisidentifies clinically relevant subtypes of glioblastomacharacterized by abnormalities in PDGFRA, IDH1, EGFR, andNF1. Cancer Cell. 17(1):98-110

I Previously identified four sub-types of GBM (GlioBlastomaMultiforme) using factor analysis and consensus clustering

I ProneuralI NeuralI ClassicalI Mesenchymal

I Most significant genes were PDGFRA, IDH1, EGFR, and NF1.

I Glioblastoma multiforme (GBM) is the most common form ofmalignant brain cancer in adults

I Affected patients have a poor prognosis with a mediansurvival of one year

Page 10: Subphenotyping in TCGA data

Gene Expression differences

I Gene expression differences:

Page 11: Subphenotyping in TCGA data

Clinical differences

I Clinical differences:

Page 12: Subphenotyping in TCGA data

Gene Ontology Enrichment

I Gene Ontology (GO) Enrichment:

Page 13: Subphenotyping in TCGA data

Data

I Patients with GBM cancer

I 202 samples with three gene expression measurements of11,861 genes.

I Note we could also include RNASeq which is another measureof Gene Expression

I Neglected due to the size of the data and the availablesamples

I Note that not all expression arrays measure the same genes sothere is some missing data

I If we wanted to use more samples we need to deal withmissing gene expression measurements across samples too

Page 14: Subphenotyping in TCGA data

Data

I Lots of missing data:

Page 15: Subphenotyping in TCGA data

Outline of talk

BackgroundTCGA ProjectSubphenotypingGeneral idea

Example studyOverviewData

ReplicationOverviewAnalysis of larger dataset

Future WorkDeep models

Conclusion

Page 16: Subphenotyping in TCGA data

Overview

I Wanted to replicate the study using other dimensionalityreduction and clustering methods to test robustness.

I Used other TCGA GBM samples, and the data of theaforementioned paper.

I Other samples: 473 samples of 17,430 genes

I Verhaak, et al.: 202 samples of 11,861 genes.

I Used GPLVM for dimensionality reduction then k-means forclustering.

Page 17: Subphenotyping in TCGA data

Clustering with GPLVM: 2D

I Larger dataset clustered with k-means on 2d latent space

Page 18: Subphenotyping in TCGA data

Clustering with GPLVM: 3D

Page 19: Subphenotyping in TCGA data

Most significantly different genes

I The expression of the genes (rows) across the GBM samples(columns). The magenta lines delineate the clusters.

I Note different genes to Verhaak, et al.

Page 20: Subphenotyping in TCGA data

Clinical differences

Cluster Total dead Mean survival time of the dead (days)

Red 91/120 (75.8%) 40.2Green 65/82 (79.3%) 41.8Yellow 139/206 (67.5%) 37.7Blue 47/65 (72.3%) 528.4

I The mean survival time of those who died demonstratesclinical differences between the clusters

Page 21: Subphenotyping in TCGA data

Clinical differences: Survival curves

I Also observe difference in survival curves:

Page 22: Subphenotyping in TCGA data

Clinical differences: Survival curves

I Looking only at those who died:

Page 23: Subphenotyping in TCGA data

Checking for artefacts: Tissue Source Site

I Clusters do not seem to correspond solely to Tissue SourceSite (source lab of sample)

Page 24: Subphenotyping in TCGA data

Outline of talk

BackgroundTCGA ProjectSubphenotypingGeneral idea

Example studyOverviewData

ReplicationOverviewAnalysis of larger dataset

Future WorkDeep models

Conclusion

Page 25: Subphenotyping in TCGA data

Future Work

I Still need to carry out Gene Ontology analysis and analyseclinical data more thoroughly (e.g. producing survival graphs)

I Compare results thoroughly with the results of Verhaak, et al.

I Repeat analysis on their dataset (mostly finished but omittedhere due to time constraints)

I Possible application of Deep Learning?

Page 26: Subphenotyping in TCGA data

Deep Probablistic Models

I Hierarchical GPLVM example with stick figure motion:

Page 27: Subphenotyping in TCGA data

Deep Probablistic Models

I TCGA data also has hierarchy:

Page 28: Subphenotyping in TCGA data

Outline of talk

BackgroundTCGA ProjectSubphenotypingGeneral idea

Example studyOverviewData

ReplicationOverviewAnalysis of larger dataset

Future WorkDeep models

Conclusion

Page 29: Subphenotyping in TCGA data

Conclusion

I Sub-phenotyping of cancer is important for discoveringclinically distinct sub-populations, and possible drug targetsfor treatments.

I Started analysis of GBM cancer data due to possiblecomparison with previously published work by Verhaak, et al.

I Main contributions of Machine Learning:I Feature selectionI Dimensionality reductionI ClusteringI Handling missing dataI Principled data fusion

I Any suggestions for these tasks would be appreciated

Thanks for your time

Questions?