unlocking the potential of public available gene expression data for large-scale analysis jonatan...
TRANSCRIPT
![Page 1: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/1.jpg)
Unlocking the potential of public available gene expression data for large-scale analysis
Jonatan TaminauPhD defense, November 2012
![Page 2: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/2.jpg)
22
Introduction
• In this thesis:
•Focus on data to information step.
•Focus on microarrays technology.
Data KnowledgeInformation
![Page 3: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/3.jpg)
33
Introduction
Data Information
Data Repositories: + Massive amounts + Examples: GEO, ArrayExpress + Publicly available!
Analysis Software: + Commercial: CLC Bio, Spotfire, etc. + Free: Bioconductor, Genepattern, Galaxy, etc. + A lot of existing research
![Page 4: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/4.jpg)
44
Introduction
“Although hundreds of thousands of samples are publicly available, and several powerful analysis software solutions exist, the research community is facing a chasm between these two resources.” (Coletta et al, 2012)
“One of the challenges for the future is how to integrate all the DNA microarray data that have been generated and deposited in public databases.” (Larsson et al, 2006)
?
![Page 5: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/5.jpg)
55
Introduction
• We identified two hurdles for large-scale microarray analysis:
① Consistent retrieval of individual datasets.
② Integrative analysis of multiple data sets.
![Page 6: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/6.jpg)
66
Outline
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6 Chapter 7
Chapter 8
Chapter 9
![Page 7: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/7.jpg)
77
Outline
Retrievalof data
IntegrativeAnalysis
Problem Statement
inSilico DB
Problem Statement
Meta-Analysis Merging
Application
![Page 8: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/8.jpg)
88
Outline
Retrievalof data
Problem Statement
inSilico DB
Problem Statement
Meta-Analysis Merging
Application
IntegrativeAnalysis
![Page 9: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/9.jpg)
99
Retrieval of genomic data
•Data is online, freely available
•But: difficult to consistently retrieve the data (Example: Baggerly & Combes,
2011)
•What does it mean?
•Data retrieval is reproducible and tractable
•No manual intervention needed
•All data is preprocessed the same
![Page 10: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/10.jpg)
1010
Retrieval of genomic data
•Typical microarray workflow:
Image
CELfileScanner Prepro-
cessing
DNAmicroarray
ImageAnalysis
numerical(‘raw’) data
Gene expressionmatrix
![Page 11: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/11.jpg)
1111
Retrieval of genomic data
CELfile Prepro-
cessing
numerical(‘raw’) data
Gene expressionmatrix
Complex + normalization/background correction + probe-to-gene mapping + versioning issues + etc.
Not Documented!
“only 48% of all data in GEO and ArrayExpress was submitted with raw data” (Larsson et al. 2006)
![Page 12: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/12.jpg)
1212
Retrieval of genomic data
+ Features+ Genes or probes+ range: 20k-30k
+ Instances+ Patients, tissues, etc.+ range: 10-100
Gene Expression Value: + Expression of gene i in sample j + range between 2-14 + log2 scaled
xij
![Page 13: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/13.jpg)
1313
Retrieval of genomic data
•What about phenotypical data or meta-data ?
•Extra information about the samples (age, gender, disease, etc.)
•No standard way of formatting this information
•MIAME / Ontologies / Free text / etc.
•Also still an open problem
![Page 14: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/14.jpg)
1414
Retrieval of genomic data
•Why is consistent retrieval from public repositories so important?
•Reproducibility of results
•Comparison of new results with existing studies
•Combining different studies
![Page 15: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/15.jpg)
1515
Outline
Retrievalof data
Problem Statement
inSilico DB
Problem Statement
Meta-Analysis Merging
Application
IntegrativeAnalysis
![Page 16: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/16.jpg)
1616
The inSilico Database
•Result of InSilico project
• Innoviris (2007-2012)
•8 persons from VUB & ULB
•Provides consistently preprocessed and expert-curated genomic data
•Being commercialized
![Page 17: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/17.jpg)
1717
The inSilico Database
•What makes the inSilico Database so valuable ?
•Not the fact that all data is precomputed
•But how it is precomputed
•What is the underlying engine ?
•Genomic Pipelines
•Backbone
![Page 18: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/18.jpg)
1818
The inSilico DB | Genomic Pipelines
•For every data type there is a different pipeline
•Microarray pipeline:
• Jobs
• Dependencies
• Backbone
![Page 19: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/19.jpg)
1919
The inSilico DB | Backbone
•Automatic Workflow System
•Barely manual intervention needed
•Control of intermediate results
•Pre-computation saves time (for the user)
•Streamlined Error management
•Automatic Monitoring
![Page 20: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/20.jpg)
2020
The inSilico DB | Backbone
•How does it works?
• Java daemon (recently replaced by application server)
•Configuration Files
![Page 21: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/21.jpg)
2121
inSilicoDb package
•One thing missing for large-scale analysis...
•Programmatic access via scripting
•Contains the basic functionality of InSilico DB
•Makes automatic retrieval of data possible!
•Seamlessly integrates with other bioconductor analysis tools
•Published in Bioinformatics, download > 2000 times
![Page 22: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/22.jpg)
2222
Outline
Retrievalof data
Problem Statement
inSilico DB
Problem Statement
Meta-Analysis Merging
Application
IntegrativeAnalysis
![Page 23: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/23.jpg)
2323
Integrative Analysis
•“Combining the information of multiple, independent but related studies in order to extract more general and more reliable results”
•Problem:
•How to do it ?
•Two approaches:
•Meta-Analysis
•Merging
![Page 24: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/24.jpg)
2424
Integrative Analysis
MergingMeta-Analysis
![Page 25: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/25.jpg)
2525
Outline
Retrievalof data
Problem Statement
inSilico DB
Problem Statement
Meta-Analysis Merging
Application
IntegrativeAnalysis
![Page 26: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/26.jpg)
2626
Meta-Analysis
+ Combining p-values + Combining effect sizes + Combining Ranks + Vote Counting + etc.
+ Depends on goal + Much focus on finding DEGs + Defines what the results look like
+ Consistent Retrieval is essential ! + inSilicoDb package
![Page 27: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/27.jpg)
2727
Meta-Analysis | Stable Genes
•365 studies were screened for stable genes
•Motivation:
• Interested in reference genes
•Currently used genes (housekeeping genes) are not ideal
•Need a compact and diverse list of genes that are stable under most conditions
• In collaboration with Dr Bram de Craene (VIB-UGent)
![Page 28: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/28.jpg)
2828
Meta-Analysis | Stable Genes
(1) Retrieve Data + inSilicoDb package + All 365 datasets downloaded in less than 100 min
(2) Calculate Stability Scores + For each gene: + Coefficient of Variation (CV) sd / mean + avoid lowly expressed genes
(3) Combine Stability Scores + For each gene take median of CVs + Rank and take top 100
(4) Semantic Similarity Filtering + Exclude genes that are related + Uses gene annotation from GO + Innovative Step! + From 100 to 10 genes
![Page 29: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/29.jpg)
2929
Meta-Analysis | Stable Genes
•Status:
•August 2012 | waiting for results…
•September 2012 | first positive results!
•November 2012 | second test case, positive feedback from NAR, manuscript in preparation…
![Page 30: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/30.jpg)
3030
Outline
Retrievalof data
Problem Statement
inSilico DB
Problem Statement
Meta-Analysis Merging
Application
IntegrativeAnalysis
![Page 31: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/31.jpg)
3131
Merging
+ Consistent Retrieval is essential ! + inSilicoDb package
+ Batch effects + Methods to remove - Location-scale - Matrix Factorization - Discretization+ Makes data compatible+ Preprocessing not
sufficient
+ Same as with single studies + Increased sample size !
![Page 32: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/32.jpg)
3232
Merging | Batch Effects
• Illustrative Example what batch effects can cause:
•We merged 4 different studies with thyroid samples
•All studies contained normal and tumor samples
• In collaboration with Wilma Van Staveren (IRIBHM, ULB)
•Samples are plotted in MDS space
•We expect two clusters
![Page 33: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/33.jpg)
3333
Merging | Batch Effects
Merging without batch effect removal Merging with batch effect removal
Legend: + symbol for study + color for normal/tumor
![Page 34: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/34.jpg)
3434
inSilicoMerging package
•R/Bioconductor package combining:
•6 different merging methods
•5 visual inspection tools
•6 quantitative measures
•Only resource so far combining all this functionality !
•Seamlessly integrates with inSilicoDb package
![Page 35: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/35.jpg)
3535
Outline
Retrievalof data
Problem Statement
inSilico DB
Problem Statement
Meta-Analysis Merging
Application
IntegrativeAnalysis
![Page 36: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/36.jpg)
3636
Identification of DEGs in Lung Cancer
• Idea: compare meta-analysis and merging approaches for integrative analysis
•We used lung cancer as case based on the content of inSilico DB.
• Ignore subtypes: DEGs can be seen as playing a role in the basic mechanisms of lung cancer
![Page 37: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/37.jpg)
3737
Identification of DEGs in Lung Cancer
•What is our hypothesis ?
•Due to the small sample sizes of individual studies there are a lot or False Negatives when using meta-analysis
•Can we avoid this by using merging as an alternative approach?
![Page 38: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/38.jpg)
3838
Identification of DEGs in Lung Cancer
MergingMeta-Analysis
Constraints: + fRMA preprocessed + > 30 samples + both normal and tumor + GPL96 or GPL570
Methodology: + apply limma - p-value < 0.05 - FC > 2+ robustness test - 100 iterations with 90% of data - resampling
+ inSilicoMerging package
+ take intersection
![Page 39: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/39.jpg)
3939
Identification of DEGs in Lung Cancer
• Meta-Analysis:
![Page 40: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/40.jpg)
4040
Identification of DEGs in Lung Cancer
• Merging:
![Page 41: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/41.jpg)
4141
Identification of DEGs in Lung Cancer
• Findings:
• Resampling helps to remove false positives
• Relatively low impact of batch effect removal methods
• More DEGs identified through merging (102) than via meta-analysis (25)“Deriving separate statistics and then averaging is often
less powerful than directly computing statistics from aggregated data.” (Xu et al, 2008)
no False Positives? + checked literature + initial pathway analysis
![Page 42: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/42.jpg)
4242
Outline
Retrievalof data
Problem Statement
inSilico DB
Problem Statement
Meta-Analysis Merging
Application
IntegrativeAnalysis
+ Contributions+ Conclusions
![Page 43: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/43.jpg)
4343
Contributions
•Genomic pipelines / backbone (Ch 4)
•Release of 2 publicly available R/Bioconductor packages (Ch 4 & 7)
•Survey of batch effect removal methods (Ch 7)
•Two applications
• Identification of stable genes via meta-analysis (Ch 6)
•Screening of potential biomarkers via integrative analysis (Ch 8)
![Page 44: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/44.jpg)
4444
Conclusions
• We identified two hurdles for large-scale microarray analysis:
① Consistent retrieval of individual datasets.
② Integration of multiple data sets for integrative analysis.
![Page 45: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/45.jpg)
4545
Conclusions
① Consistent retrieval of individual datasets. inSilicoDb package
② Integration of multiple data sets for integrative analysis. inSilicoMerging package
Paving the road towards unlocking the potential of public available gene expression studies
![Page 46: Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012](https://reader030.vdocument.in/reader030/viewer/2022032723/56649d085503460f949da1d0/html5/thumbnails/46.jpg)
4646
Thanks!
+ InSilico Team!+ Jury!
+ Audience!
+ Yann-Michaël!