mining public data for insights into human disease

Mining Public Data for Insights into Human Disease

11/16/2009

Baliga Lab Meeting

Chris Plaisier

Utility of Gene Expression for Human Disease

Microarray Technology

Big Picture

Data Access

Gene Expression Microarray Repositories

• Gene Expression Omnibus (GEO) Hosted by: NCBI Platform: All accepted Normalization: Experiment by experiment basis Access: R (GEOquery), EUtils Meta-Information: GEOMetaDB

• ArrayExpress Hosted by: EMBL Platform: All accepted Normalization: Experiment by experiment basis Access: Web interface, EMBL API Meta-Information: ? (API)

• Many smaller repositories which have more phenotypic information for specific diseases Phenotypic information may be hard to access

Gene Expression Omnibus

Samples Per Platform in GEO

HGU133 Plus 2.0

HGU133A

Latest 3’ Affymetrix Array

Affymetrix arrays account for ~67% of humangene expression data in public repositories.

Affymetrix Probesets

Probe ProbePair

Probeset(11 Probe Pairs)

Perfect Match

Mismatch

GeneChip U133 Plus 2.0 Array(Image stored as CEL file.)

>54,000 Probesets

25 nucleotides

Pre-Processing 101

Pre-Processing Gene Expression Data

Removing Miss-Targeted and Non-Specific Probes

CELFile

CDFFile

Intensities

Normally CDF File Comes from Affymetrix

Zhang, et al. 2005

CELFile

AltCDFFile

Intensities

Alternative CDF File Thorougly Cleaned

What Makes Cells Different?

PANP: Presence/Absence Filtering

• Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution

NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand

• Utilize this background distribution from these NSMPs to threshold the entire dataset

• Output is a call for each array for each gene

Calls are:• P = presence• M = marginal• A = Absence

Identifying Present Genes

• Filter out genes ≥ 50% absent Whole dataset Subsets

• Only present genes are utilized in future analyses

Removing Redundancy

Reason for Removing Redundancy Before Running

Removing Redundancy

• Collapse Affymetrix Probeset IDs to EntrezIDs

• Test for correlation between probesets If correlation is ≥ 0.8 then combine probesets If not then leave them separate

Pre-Processing Pipeline

= Implemented in R

= Implemented in Python

Big Picture

Glioma:A Deadly Brain Cancer

Wikimedia commons

Brain Anatomy

Wikimedia commons

What do they do?

Neurophysiology

Hierarchy ofNervous Tissue Tumors

Glioma

WHO Grade Tumor TypePercentage of CNS

Tumors

I Pilocytic Astrocytoma

9.8%IIDiffuse or Low-Grade

Astrocytoma

III Anaplastic Astrocytoma

IV Glioblastoma Multiforme 20.3%

Gliomas account for 40% of all tumors and 78% of malignant tumors.

Buckner et al., 2007

Glioma Survival

http://www.neurooncology.ucla.edu/

5 years

10 years

Repository of Molecular Brain Neoplasia Data (REMBRANDT)

• REMBRANDT (Madhavan et al., 2009) Currently 257 individual specimens

• Glioblastoma multiforme (GBM) = 110• Astrocytoma = 50• Oligodendroglioma = 55• Mixed = 21• Non-Tumor = 21

Phenotypes• Tumor type:

GBM, Astrocytoma, etc.• WHO Grade:

176 individuals• Age:

253 individuals• Sex:

250 individuals (partially inferred using Y chromosome genes)• Survival (days post diagnosis):

169 individuals

REMBRANT:Chromosome Y Expression

Se

x spe

cificg

en

e e

xpre

ssion

Female Male

Conversions of male to female should be more common than the other way,because it is difficult for females to express the Y chromosome.

4 females clusterwith males

8 males clusterwith females

REMBRANT:Chr. Y Expression – Intelligent Reassignment

Se

x spe

cificg

en

e e

xpre

ssion

Female Male

Intelligent Reassignment – If previous call of sex is for other group then the callis turned into an NA. All unknowns are given a call.

Progression of Astrocytic Glioma

Furnari, et al. (2007)

Modeling Glioma

• Increasing metastatic potential and severity of glioma could be modeled using this simple schema

• Correlation of model to survival post diagnosis is -0.68

0

1

2

Exploring Meta-Information

• Age explains 31% of survival post diagnosis

• Age explains 25% of the progression model

• Sex does not have a significant effect on either survival or the progression model Yet it is known that glioblastoma is slightly more

common in men than in women

Summary

• Very ample dataset with good amount of meta-information

• Ready for dimensionality reduction and network inference!

Big Picture

Clustering asDimensionality Reduction

Big Picture

Likely Issues

• Size of eukaryotic genomes

• Added complexity of regulatory regions

• Tissue and cell type heterogeneity

• Patient genetic and environmental heterogeneity

Relative Genome Sizes

Solutions

• Pre-process genomic sequences

• Reduce data complexity by collapsing redundancies

• Utilize filters that select for only the most variant genes

Likely Issues





Eukaryotic Gene Structure


TranscriptionalStartSite Start

Codon

Untranslated Regions


Exons


Introns

Regulatory Regions

3’ UTR

miRNA binding sites(4-9bp motifs)

Promoter

Transcription FactorBinding Sites(6-12bp motifs)

No set length forpromoters in eukaryotes.

Grabbing 2Kbp, so we canuse 2Kbp or smaller.

Median 3’ UTRlength is 831bp

Three Examples After Capture

85% (n = 36,177) of probesets are associated with a sequence

Solution

• Do motif detection on both promoter and 3’ UTR sequences

• Incorporate both of these regulatory regions into the cMonkey bi-cluster scoring matrix

Promoter Sequences

• Looking for transcription factor binding sites (TFBS) Using MEME with 6-12bp motif widths

• Utilized RefSeq gene mapping to identify putative promoter regions 2Kbp of sequence upstream of transcriptional start

site (TSS) was grabbed

• If two RefSeq gene mappings did not overlap then the longest transcripts promoter was taken

3’ UTR Sequences

• Looking for miRNA binding sites miRNA are 21bp RNA

molecules that bind to mRNA and alter expression

Using MEME with 4-9bp motif widths

Likely Issues





Complexity ofMammalian Systems

Cellular Heterogeneityin Tissues

What Makes Cells Different?

Solution

• Filter our genes that are not expressed for each tissue, leaving only those that are expressed

• Enhance the capability of the software to handle missing data

Likely Issues





Intelligent Sample Collection

• Genetic and environmental heterogeneity are real world issues

• Can try to match for certain confounders

• Or stratify analyses based on particular confounders

Running cMonkey

• Running cMonkey on AEGIR cluster 10 nodes with 8 cores per

node

1 node has 24GB ram

2 others have 16GB ram

• Completion time depending heavily on the size of the run

Beautiful NewResult Interface

Looking at a Cluster

Chris’s Graphics Mods

Original cMonkey Output

Sorted cMonkey Output

Boxplot For All Samples

Boxplot for In Samples

Integrating Phenotypes

What to do when you find a cluster?

Checking Out PSSM #1

Known Motif?

Motif Known?

What do the genes do?

Functional Enrichment?

Functional Enrichment

Genes?

Interesting Cluster

Phenotype Correlations

• Survival – Correlation coefficient = -0.48 P-value = 3.2 x 10-11

• Progression Model – Correlation coefficient = 0.55 P-value = 6.7 x 10-16

• Age – Correlation coefficient = 0.32 P-value = 2.2 x 10-7

• Sex – Correlation coefficient = -0.27 P-value = 0.0012

Bonferroni corrected significant p-value ≤ (0.05 / (585*4)) ≤ 2.1 x 10-5

Genes from Cluster

AFFY_ID Gene Symbol Gene Name

212067_S_AT C1R complement component 1, r subcomponent

208747_S_AT C1S complement component 1, s subcomponent

201743_AT CD14 cd14 antigen

215049_X_AT CD163 cd163 antigen

203854_AT CFI complement factor i

213060_S_AT CHI3L2 chitinase 3-like 2

208146_S_AT CPVL carboxypeptidase, vitellogenic-like

201798_S_AT FER1L3 fer-1-like 3, myoferlin (c. elegans)

206584_AT LY96 lymphocyte antigen 96

202180_S_AT MVP major vault protein

204150_AT STAB1 stabilin 1

204924_AT TLR2 toll-like receptor 2

= Previously known to be differentially expressed in GBM.

Motif Matches

PSSM #2

PSSM #1

Summary

• Very promising results

• Need to further develop certain aspects of cMonkey to better utilize the human data

• Then need to build network inference component

General Questions

• Biclustering or not?

• How many genes to run?

• How much sequence to feed MEME?

• Can more than one experiment be included?

Cluster Samples, or Not?

• Bi-clustering clusters not only on genes but also by experimental conditions (samples)

• Because we are using just one experiment it may not be necessary to cluster samples

• Although it may be useful again once other experiments are included

Bi-clustering or Not?

Bi-clustering Gene Clustering Only

Brief Glance

• Looks like for this dataset it may make more sense to only cluster genes More clusters with significant motifs

• Although this is likely to change once we add more experiments to the mix

• Need a method to quantify this

General Questions





Maxing Out cMonkey

• Can cMonkey handle running all genes Yes, without doing motif finding With motif finding this will take a long time (weeks?),

and tends to crash out

• Essentially need to balance sequence length for motif finding with cluster size and number of clusters


General Questions





Length for Promoters?

• MEME suggests 1Kbp or less for sequences as input

• Tried using 500bp, 1Kbp, 2Kbp, 2.5Kbp, and 5Kbp

Brief Glance

• So far looks like the 500bp give the most clusters with motifs


General Questions





Breast Cancer Metastasis

Bos et al., 2009

cMonkey for Eukaryotes

Future Modifications to cMonkey for eukaryotes:

Preprocess sequence data

Add 3’ UTR miRNA motif detection

Integrate 3’ UTR miRNA motif scores with promoter motif scores

Network Inference

• cMonkey software is utilized to produce the bi-clusters

• Inferelator can then be used to identify regulatory factors

• Simple correlation with phenotypes can relate bi-clusters to disease

Acknowledgements

Baliga Lab• Nitin• David• Chris• Dan

Hood Lab• Burak Kutlu

• Luxembourg Project• REMBRANDT

mining public data for insights into human disease

Documents