mining public data for insights into human disease
DESCRIPTION
Mining Public Data for Insights into Human Disease. 11/16/2009 Baliga Lab Meeting Chris Plaisier. Utility of Gene Expression for Human Disease. Microarray Technology. Big Picture. Data Access. Gene Expression Microarray Repositories. Gene Expression Omnibus (GEO) Hosted by: NCBI - PowerPoint PPT PresentationTRANSCRIPT
Mining Public Data for Insights into Human Disease
11/16/2009
Baliga Lab Meeting
Chris Plaisier
Utility of Gene Expression for Human Disease
Microarray Technology
Big Picture
Data Access
Gene Expression Microarray Repositories
• Gene Expression Omnibus (GEO) Hosted by: NCBI Platform: All accepted Normalization: Experiment by experiment basis Access: R (GEOquery), EUtils Meta-Information: GEOMetaDB
• ArrayExpress Hosted by: EMBL Platform: All accepted Normalization: Experiment by experiment basis Access: Web interface, EMBL API Meta-Information: ? (API)
• Many smaller repositories which have more phenotypic information for specific diseases Phenotypic information may be hard to access
Gene Expression Omnibus
Samples Per Platform in GEO
HGU133 Plus 2.0
HGU133A
Latest 3’ Affymetrix Array
Affymetrix arrays account for ~67% of humangene expression data in public repositories.
Affymetrix Probesets
Probe ProbePair
Probeset(11 Probe Pairs)
Perfect Match
Mismatch
GeneChip U133 Plus 2.0 Array(Image stored as CEL file.)
>54,000 Probesets
25 nucleotides
Pre-Processing 101
Pre-Processing Gene Expression Data
Removing Miss-Targeted and Non-Specific Probes
CELFile
CDFFile
Intensities
Normally CDF File Comes from Affymetrix
Zhang, et al. 2005
CELFile
AltCDFFile
Intensities
Alternative CDF File Thorougly Cleaned
Pre-Processing Gene Expression Data
What Makes Cells Different?
PANP: Presence/Absence Filtering
• Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution
NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand
• Utilize this background distribution from these NSMPs to threshold the entire dataset
• Output is a call for each array for each gene
Calls are:• P = presence• M = marginal• A = Absence
Identifying Present Genes
• Filter out genes ≥ 50% absent Whole dataset Subsets
• Only present genes are utilized in future analyses
Pre-Processing Gene Expression Data
Removing Redundancy
Reason for Removing Redundancy Before Running
Removing Redundancy
• Collapse Affymetrix Probeset IDs to EntrezIDs
• Test for correlation between probesets If correlation is ≥ 0.8 then combine probesets If not then leave them separate
Pre-Processing Gene Expression Data
Pre-Processing Pipeline
= Implemented in R
= Implemented in Python
Big Picture
Glioma:A Deadly Brain Cancer
Wikimedia commons
Brain Anatomy
Wikimedia commons
What do they do?
Neurophysiology
Hierarchy ofNervous Tissue Tumors
Glioma
WHO Grade Tumor TypePercentage of CNS
Tumors
I Pilocytic Astrocytoma
9.8%IIDiffuse or Low-Grade
Astrocytoma
III Anaplastic Astrocytoma
IV Glioblastoma Multiforme 20.3%
Gliomas account for 40% of all tumors and 78% of malignant tumors.
Buckner et al., 2007
Glioma Survival
http://www.neurooncology.ucla.edu/
5 years
10 years
Repository of Molecular Brain Neoplasia Data (REMBRANDT)
• REMBRANDT (Madhavan et al., 2009) Currently 257 individual specimens
• Glioblastoma multiforme (GBM) = 110• Astrocytoma = 50• Oligodendroglioma = 55• Mixed = 21• Non-Tumor = 21
Phenotypes• Tumor type:
GBM, Astrocytoma, etc.• WHO Grade:
176 individuals• Age:
253 individuals• Sex:
250 individuals (partially inferred using Y chromosome genes)• Survival (days post diagnosis):
169 individuals
REMBRANT:Chromosome Y Expression
Se
x spe
cificg
en
e e
xpre
ssion
Female Male
Conversions of male to female should be more common than the other way,because it is difficult for females to express the Y chromosome.
4 females clusterwith males
8 males clusterwith females
REMBRANT:Chr. Y Expression – Intelligent Reassignment
Se
x spe
cificg
en
e e
xpre
ssion
Female Male
Intelligent Reassignment – If previous call of sex is for other group then the callis turned into an NA. All unknowns are given a call.
Progression of Astrocytic Glioma
Furnari, et al. (2007)
Modeling Glioma
• Increasing metastatic potential and severity of glioma could be modeled using this simple schema
• Correlation of model to survival post diagnosis is -0.68
0
1
2
Exploring Meta-Information
• Age explains 31% of survival post diagnosis
• Age explains 25% of the progression model
• Sex does not have a significant effect on either survival or the progression model Yet it is known that glioblastoma is slightly more
common in men than in women
Summary
• Very ample dataset with good amount of meta-information
• Ready for dimensionality reduction and network inference!
Big Picture
Clustering asDimensionality Reduction
Big Picture
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental heterogeneity
Relative Genome Sizes
Solutions
• Pre-process genomic sequences
• Reduce data complexity by collapsing redundancies
• Utilize filters that select for only the most variant genes
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental heterogeneity
Eukaryotic Gene Structure
Eukaryotic Gene Structure
TranscriptionalStartSite Start
Codon
Untranslated Regions
Eukaryotic Gene Structure
Exons
Eukaryotic Gene Structure
Introns
Regulatory Regions
3’ UTR
miRNA binding sites(4-9bp motifs)
Promoter
Transcription FactorBinding Sites(6-12bp motifs)
No set length forpromoters in eukaryotes.
Grabbing 2Kbp, so we canuse 2Kbp or smaller.
Median 3’ UTRlength is 831bp
Three Examples After Capture
85% (n = 36,177) of probesets are associated with a sequence
Solution
• Do motif detection on both promoter and 3’ UTR sequences
• Incorporate both of these regulatory regions into the cMonkey bi-cluster scoring matrix
Promoter Sequences
• Looking for transcription factor binding sites (TFBS) Using MEME with 6-12bp motif widths
• Utilized RefSeq gene mapping to identify putative promoter regions 2Kbp of sequence upstream of transcriptional start
site (TSS) was grabbed
• If two RefSeq gene mappings did not overlap then the longest transcripts promoter was taken
3’ UTR Sequences
• Looking for miRNA binding sites miRNA are 21bp RNA
molecules that bind to mRNA and alter expression
Using MEME with 4-9bp motif widths
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental heterogeneity
Complexity ofMammalian Systems
Cellular Heterogeneityin Tissues
What Makes Cells Different?
Solution
• Filter our genes that are not expressed for each tissue, leaving only those that are expressed
• Enhance the capability of the software to handle missing data
Likely Issues
• Size of eukaryotic genomes
• Added complexity of regulatory regions
• Tissue and cell type heterogeneity
• Patient genetic and environmental heterogeneity
Intelligent Sample Collection
• Genetic and environmental heterogeneity are real world issues
• Can try to match for certain confounders
• Or stratify analyses based on particular confounders
Running cMonkey
• Running cMonkey on AEGIR cluster 10 nodes with 8 cores per
node
1 node has 24GB ram
2 others have 16GB ram
• Completion time depending heavily on the size of the run
Beautiful NewResult Interface
Looking at a Cluster
Chris’s Graphics Mods
Original cMonkey Output
Sorted cMonkey Output
Boxplot For All Samples
Boxplot for In Samples
Integrating Phenotypes
What to do when you find a cluster?
Checking Out PSSM #1
Known Motif?
Motif Known?
What do the genes do?
Functional Enrichment?
Functional Enrichment
Genes?
Interesting Cluster
Phenotype Correlations
• Survival – Correlation coefficient = -0.48 P-value = 3.2 x 10-11
• Progression Model – Correlation coefficient = 0.55 P-value = 6.7 x 10-16
• Age – Correlation coefficient = 0.32 P-value = 2.2 x 10-7
• Sex – Correlation coefficient = -0.27 P-value = 0.0012
Bonferroni corrected significant p-value ≤ (0.05 / (585*4)) ≤ 2.1 x 10-5
Genes from Cluster
AFFY_ID Gene Symbol Gene Name
212067_S_AT C1R complement component 1, r subcomponent
208747_S_AT C1S complement component 1, s subcomponent
201743_AT CD14 cd14 antigen
215049_X_AT CD163 cd163 antigen
203854_AT CFI complement factor i
213060_S_AT CHI3L2 chitinase 3-like 2
208146_S_AT CPVL carboxypeptidase, vitellogenic-like
201798_S_AT FER1L3 fer-1-like 3, myoferlin (c. elegans)
206584_AT LY96 lymphocyte antigen 96
202180_S_AT MVP major vault protein
204150_AT STAB1 stabilin 1
204924_AT TLR2 toll-like receptor 2
= Previously known to be differentially expressed in GBM.
Motif Matches
PSSM #2
PSSM #1
Summary
• Very promising results
• Need to further develop certain aspects of cMonkey to better utilize the human data
• Then need to build network inference component
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be included?
Cluster Samples, or Not?
• Bi-clustering clusters not only on genes but also by experimental conditions (samples)
• Because we are using just one experiment it may not be necessary to cluster samples
• Although it may be useful again once other experiments are included
Bi-clustering or Not?
Bi-clustering Gene Clustering Only
Brief Glance
• Looks like for this dataset it may make more sense to only cluster genes More clusters with significant motifs
• Although this is likely to change once we add more experiments to the mix
• Need a method to quantify this
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be included?
Maxing Out cMonkey
• Can cMonkey handle running all genes Yes, without doing motif finding With motif finding this will take a long time (weeks?),
and tends to crash out
• Essentially need to balance sequence length for motif finding with cluster size and number of clusters
• Need a method to quantify this
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be included?
Length for Promoters?
• MEME suggests 1Kbp or less for sequences as input
• Tried using 500bp, 1Kbp, 2Kbp, 2.5Kbp, and 5Kbp
Brief Glance
• So far looks like the 500bp give the most clusters with motifs
• Need a method to quantify this
General Questions
• Biclustering or not?
• How many genes to run?
• How much sequence to feed MEME?
• Can more than one experiment be included?
Breast Cancer Metastasis
Bos et al., 2009
cMonkey for Eukaryotes
Future Modifications to cMonkey for eukaryotes:
Preprocess sequence data
Add 3’ UTR miRNA motif detection
Integrate 3’ UTR miRNA motif scores with promoter motif scores
Network Inference
• cMonkey software is utilized to produce the bi-clusters
• Inferelator can then be used to identify regulatory factors
• Simple correlation with phenotypes can relate bi-clusters to disease
Acknowledgements
Baliga Lab• Nitin• David• Chris• Dan
Hood Lab• Burak Kutlu
• Luxembourg Project• REMBRANDT