[ppt]slide 1 - harvard universitycompbio.dfci.harvard.edu/compbio/education/presentations... · web...
TRANSCRIPT
Driving Discovery Through Data Driving Discovery Through Data Integration and AnalysisIntegration and Analysis
John QuackenbushJohn QuackenbushMolecular Diagnostics World 2010Molecular Diagnostics World 2010
28 October 201028 October 2010
TreatmentTreatmentOptionsOptions
QualityQualityOf LifeOf Life
GeneticGeneticRiskRisk
EarlyEarlyDetectionDetection
Patient Patient StratificationStratification
DiseaseDiseaseStagingStaging
OutcomesOutcomes
Natural History of DiseaseNatural History of Disease Clinical CareClinical Care
EnvironmentEnvironment + Lifestyle+ Lifestyle
BirthBirth TreatmentTreatment DeathDeath
Disease Progression and Disease Progression and Personalized CarePersonalized Care
BiomarkersBiomarkers
Assure access to samples and rational consentAssure access to samples and rational consent
Develop a technology platformDevelop a technology platform
Make information integration as a central missionMake information integration as a central mission
Conduct research as a vital componentConduct research as a vital component
Present data and information to the local communityPresent data and information to the local community
Enable research beyond your ownEnable research beyond your own
Engage corporate partnersEngage corporate partners
Communicating the mission to the community.Communicating the mission to the community.
Turning the vision into a realityTurning the vision into a reality
Patients want to be part of the process of curing diseasePatients want to be part of the process of curing disease
Informed consent needs to be structured to allow patients to Informed consent needs to be structured to allow patients to be partners in the research processbe partners in the research process
HIPPA requires both informed consent and that we assure HIPPA requires both informed consent and that we assure patient confidentialitypatient confidentiality
But “identifiability” is a moving target in a genomic ageBut “identifiability” is a moving target in a genomic age
With the <$1000 genome, in the age of Facebook, what this With the <$1000 genome, in the age of Facebook, what this means remains unclearmeans remains unclear
The new Genomics is a disruptive technology.The new Genomics is a disruptive technology.
Access, Research, SecurityAccess, Research, Security
2006: State of the Art Sequencing 2006: State of the Art Sequencing
74x Capillary Sequencers74x Capillary Sequencers10 FTEs10 FTEs15-40 runs per day15-40 runs per day1-2Mb per instrument per day1-2Mb per instrument per day120Mb total capacity per day 120Mb total capacity per day
SEQUENCINGSEQUENCING
Rooms of equipmentRooms of equipmentSubcloning > picking > prepping Subcloning > picking > prepping 35 FTEs35 FTEs3-4 weeks3-4 weeks
PRODUCTIONPRODUCTION
Sequencing the genome took ~15 years and $3BSequencing the genome took ~15 years and $3B
2008: Enabling a New Era in Genome 2008: Enabling a New Era in Genome Analysis Analysis
1x Cluster Station1x Cluster Station1 FTE1 FTE1 day1 day
PRODUCTIONPRODUCTION
1x Genome Analyzer1x Genome AnalyzerSame FTE as aboveSame FTE as above1 run per 5 days1 run per 5 days15 Gb per instrument per run15 Gb per instrument per run>3 Gb per day (1x genome coverage) >3 Gb per day (1x genome coverage)
SEQUENCINGSEQUENCING
We can now re-sequence the genome in a ~1 weekWe can now re-sequence the genome in a ~1 week
The ChallengeThe ChallengeNew technologies inspired by the Human Genome New technologies inspired by the Human Genome Project are transforming Project are transforming biomedical research biomedical research from from a laboratory science to an a laboratory science to an information scienceinformation science
We need new approaches to making sense of the We need new approaches to making sense of the data we generatedata we generate
The winners in the race to understand disease are The winners in the race to understand disease are going to be those best able to collect, manage, going to be those best able to collect, manage, analyze, and interpret the data.analyze, and interpret the data.
GeneGene ProteinProteinRNARNA
NetworkNetwork
http://compbio.dfci.harvard.eduhttp://compbio.dfci.harvard.edu
Gene Index Gene Index DatabasesDatabases
ResourcererResourcererOther DatabasesOther Databases
TM4TM4MicroarrayMicroarraySoftwareSoftware
Other toolsOther toolsMeSHerMeSHer
ClusterMedClusterMedBayesian NetsBayesian Nets
DNA MicroarrayDNA MicroarrayAnalysisAnalysis
Candidate Gene(s)Candidate Gene(s)
Perturb Network (RNAi)Perturb Network (RNAi)
Assay Response (Assay Response (A)A)
Predict NetworkPredict Network
PatientPatient
DNA MicroarrayDNA MicroarrayAnalysisAnalysis
CentralCentralWarehouseWarehouse
Other Things:Other Things:Mesoscopic ExpressionMesoscopic ExpressionCorrelated SignaturesCorrelated SignaturesState Space Gene ModelsState Space Gene ModelsTiling Arrays to Genes Tiling Arrays to Genes
ClinicalClinicalDataData MetabolomicsMetabolomics
ProteomicsProteomicsTranscriptomicsTranscriptomics
CytogenomicsCytogenomics
EpigenomicsEpigenomics
GenomicsGenomics
PublishedPublishedDatasetsDatasets
DrugDrugBankBank
TheTheHapMapHapMap
TheTheGenomeGenome
DiseaseDiseaseDatabasesDatabases
(OMIM)(OMIM)
PubMedPubMed
ClinicalClinicalTrialsTrials
ChemicalChemicalBiologyBiology
Etc.Etc.
Beating Information OverloadBeating Information Overload
CentralCentralWarehouseWarehouse
Improved DiagnosticsImproved DiagnosticsIndividualized TherapiesIndividualized TherapiesMore Effective AgentsMore Effective Agents
PortalsPortals
Web Center PortalWeb Center Portal
CC
AA BB
DD
FactsFacts
CustomCustom
CC
AA BB
DD
FactsFacts
Business IntelligenceBusiness Intelligence
Build or BuyBuild or Buy
OracleOracle
ExistingExistingEnterprise Service B
usEnterprise Service B
us
RulesRulesEngineEngine
BPELBPEL
genomicsgenomics
HTB ODSHTB ODS
De-identificationDe-identification MappingMapping
TerminologyTerminology SecuritySecurity
EMPIEMPI
AuditingAuditing
IDXIDX
RxRx
LabLab
Clinical Clinical TrialTrial
…………Dan
a Fa
rber
Clin
ical
Sys
tem
sD
ana
Farb
er C
linic
al S
yste
ms
BAMBAMDashboardDashboard
OMICSOMICS
Dan
a D
ana
Farb
erFa
rber
Lab
Lab
Exte
rnal
Exte
rnal
PartnersPartners
Clinical Clinical PathwaysPathways
Web Service DirectoryWeb Service Directory
Idm &Idm &SecuritySecurity
Severity ScoreSeverity Score……....
RFIDRFID
Exte
rnal
Ex
tern
al
mis
cm
isc PubMedPubMed
GenBankGenBank
Dana-Farber Research DB Conceptual Architecture
An Example: Signature Analysis
Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard
ArrayExpress
GEO
RandomWebsites
Fenglong Liu
Warehouse
GeneChip Oncology DatabaseGeneChip Oncology Database
Fenglong Liu
Soon to be replaced by EBI’s
Soon to be replaced by EBI’s
Gene Expression Atlas
Gene Expression Atlas
Analysis
An Example: Signature Analysis
Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard
PubMed
ArrayExpress
GEO
RandomWebsites
Fenglong Liu
KermPicard
Warehouse
In-HouseStudies
GeneSigDB – release 2
http://compbio.dfci.harvard.edu/genesigdbhttp://compbio.dfci.harvard.edu/genesigdb
Breast Cancer has unique signatures
Aedin Culhane, Daniel GusenleitnerAedin Culhane, Daniel Gusenleitner
A sample research question
How many Multiple Myeloma patients, with bone marrow or blood samples in the bank, and who have a chromosome 13 deletion, responded (complete, partial, or minor remission) to therapy and how many did not respond?
A Path ForwardWe are working to develop a two-way strategy for futureWe are working to develop a two-way strategy for future
Clinic → LabClinic → LabLab → ClinicLab → Clinic
Consider OncotypeDxConsider OncotypeDx
This approach represents the intellectual framework for This approach represents the intellectual framework for future success – and the bridges between the various future success – and the bridges between the various laboratories and programs.laboratories and programs.
Bayesian NetworksBayesian NetworksAmira DjebbariAmira DjebbariRaktin SinhaRaktin SinhaDan SchlauchDan Schlauch
When we say “Networks” we mean…When we say “Networks” we mean…
Genes are represented as “nodes”Genes are represented as “nodes”
Interactions are represented by Interactions are represented by “edges”“edges”
Edges can be directed to show Edges can be directed to show “causal” interactions“causal” interactions
Edges are Edges are not necessarilynot necessarily direct direct interactionsinteractions
Bayesian network - exampleBayesian network - example
Gene1Gene1 Gene2=1|Gene1Gene2=1|Gene1-1-1 0.10.100 0.20.211 0.70.7
Conditional Conditional probability table at probability table at
node “Gene2”node “Gene2”
Edges represent dependenciesEdges represent dependencies
Learning Bayesian networks: Learning Bayesian networks: StructureStructure Conditional probability tablesConditional probability tables
Gene1
Gene4
Gene3Gene2
Bayesian networks - priorsBayesian networks - priorsNo free lunch theorem (Wolpert & MacReady, 1996):No free lunch theorem (Wolpert & MacReady, 1996):
The performance of general-purpose optimization algorithm The performance of general-purpose optimization algorithm iterated on cost function is independent of the algorithm when iterated on cost function is independent of the algorithm when averaged over all cost functions. averaged over all cost functions.
Suggests that when considering a specific application one can Suggests that when considering a specific application one can introduce a introduce a potentially useful bias potentially useful bias using domain knowledgeusing domain knowledge
A low-cost lunch?A low-cost lunch?One can “help” the search along by One can “help” the search along by providing a seed structure representing providing a seed structure representing what we believe is the most likely networkwhat we believe is the most likely networkThe network search process will then use The network search process will then use gene expression data to look for gene expression data to look for perturbations on the structure that are perturbations on the structure that are supported by the datasupported by the dataThere are many possible sources of prior There are many possible sources of prior structures including the Biomedical structures including the Biomedical literature and large-scale interaction studies literature and large-scale interaction studies (PPI)(PPI)
Bayesian networks using Bayesian networks using microarray data and literaturemicroarray data and literature
Test Set: Golub et al. ALL/AML datasetTest Set: Golub et al. ALL/AML dataset
Learn BN with literature network as prior structure, Learn BN with literature network as prior structure, Protein-Protein Interaction data (PPI), and Protein-Protein Interaction data (PPI), and literature+PPIliterature+PPIPerform 200 bootstrap network estimations and find Perform 200 bootstrap network estimations and find links that are “high confidence”links that are “high confidence”Compare without prior (microarray data only)Compare without prior (microarray data only)vs. with prior structure from the literature to look for vs. with prior structure from the literature to look for known interactions.known interactions.
Amira DjebbariAmira Djebbari
BN: Literature + PPIBN: Literature + PPI
Cell Cycle Gene SubnetworkCell Cycle Gene Subnetwork
Improving the SeedsImproving the SeedsCo-occurrence does not a provide Co-occurrence does not a provide directionality for interactions, but a directionality for interactions, but a BN is a DAG and our assignment is ad hocBN is a DAG and our assignment is ad hoc
The literature contains information about how The literature contains information about how we the genes (and their products) interactwe the genes (and their products) interact
The challenge is extracting that information The challenge is extracting that information from the literature—there is too much to readfrom the literature—there is too much to read
Text mining doesn’t work well for the Text mining doesn’t work well for the biomedical literature.biomedical literature.
Improving the Seeds (2)Improving the Seeds (2)Solution: Use a hybrid approach!Solution: Use a hybrid approach!
Use text-mining tools to find sentences that Use text-mining tools to find sentences that contain names of two or more genescontain names of two or more genes
Use the Amazon Mechanical Turk to extract Use the Amazon Mechanical Turk to extract [subject]—[predicate]—[object] triples[subject]—[predicate]—[object] triples
Define relationships between genes based on Define relationships between genes based on the “consensus” interactionthe “consensus” interaction
Combine these results with pathway Combine these results with pathway databases to build seed networks.databases to build seed networks.
Present data and information Present data and information to the local communityto the local community
PAGE DETAILS
- View aggregate statistics- View cohort details- Build cohort sets- Build composite phenotypes
Actions:
-Go to data download for selected cohort -Go to assay detail for selected cohort-Go to cohort manager
PAGE DETAILS
Search-Facets-Search within results-Keyword prompts-Search history
Table:-Paged results-Sortable columns
Actions:-Go to Gene detail page-Add genes to ‘gene set’
Gene Expression Summary
RNASeq
PAGE DETAILS
Annotation summary & summary view for each assay/data type:
Accordion style sections
-GEXP – expression profile across major Dx categories-RNASeq – Exon structure of the gene-SNPs – Table of SNPs in region of gene, highlighting association with major Dx group- Methylation – Methylation profile in region around gene-Genomic alterations – table of CNVs & alterations observed w/ freq in region around gene
Actions:- Click through to assay detail page-Add gene to set
Annotation Summary
Analysis ToolsAnalysis Tools
PAGE DETAILSPAGE DETAILS
-Very minimal parameters and Very minimal parameters and options…here just 2 cohorts of options…here just 2 cohorts of interest, maybe p-value cutoff interest, maybe p-value cutoff
Generates comprehensive reportGenerates comprehensive report
Edit in place results – Don’t set Edit in place results – Don’t set parameters, edit the resultsparameters, edit the results
Analysis goes into queue, email Analysis goes into queue, email notification when finishednotification when finished
Cohort 1:Cohort 1:
Cohort 2:Cohort 2:
Set 1Set 1
Set 2Set 2
Start AnalysisStart AnalysisView analysis parametersView analysis parameters
Job StatusJob Status RunningRunning
Job name: Job name: My job 1My job 1
Analysis of Differential Expression: My Job 1
Supervised Analysis
Meta analysis
Unsupervised analysis
PAGE DETAILS
-Very minimal parameters and options.
Generates comprehensive report
Edit in place results – Don’t set parameters, edit the results
Accordion style result sections
Generate PDF report of analysis
Analysis goes into queue, email notification when finished
We received an $1M Oracle Commitment grant to create We received an $1M Oracle Commitment grant to create our integrated clinical/research data warehouseour integrated clinical/research data warehouse
We’ve partnered with IDBS to create data portalsWe’ve partnered with IDBS to create data portals
We are working with Illumina on a variety of projectsWe are working with Illumina on a variety of projects
We are forging relationships with Thomson-Reuters to link We are forging relationships with Thomson-Reuters to link genomic profiling data to drug, trial, and patent informationgenomic profiling data to drug, trial, and patent information
We are building partnerships with Roche, Genomatix, We are building partnerships with Roche, Genomatix, NEB, and others interested in entering the personal NEB, and others interested in entering the personal genomics space.genomics space.
We need to find the best toolsWe need to find the best tools
The Mission
The mission of the CCCB is to provide broad-based support for the analysis and interpretation of ‘omic data and in doing so to further basic, clinical and translational research. CCCB also will conduct research that opens new ways of understanding cancer.
CCCB Service Offering
IT Infrastructure-Application hosting-Data management-Custom software development-Comprehensive collaboration portals
CCCB Service Offering
Next-Gen Sequencing -Competitive per-lane pricing-Integrated informatics-Major focus for development in 2010
IT In
frast
ruct
ure
CCCB Service Offering
Analytical Consulting -Bioinformatics / statistical data analysis-Experimental design-Value-add for IT/Sequencing services
IT In
frast
ruct
ure
Seq
uenc
ing
CCCB Collaborative Consulting Model
1. Initial meeting to understand project scope and objectives
2. Development of an analysis plan and time/cost estimate
3. During project execution, data and results are exchanged through a secure, password-protected collaboration portal
4. Available as ad-hoc service, or larger scale support agreementsIT In
frast
ruct
ure
Seq
uenc
ing
Consulting
The Gene Index TeamThe Gene Index TeamCorina AntonescuCorina Antonescu
Valentin AntonescuValentin AntonescuFenglong LiuFenglong LiuGeo PerteaGeo Pertea
Razvan SultanaRazvan SultanaJohn QuackenbushJohn Quackenbush
Microarray Expression TeamMicroarray Expression Team Stefan BentinkStefan Bentink
Thomas ChittendenThomas ChittendenAedin CulhaneAedin CulhaneKristina HoltonKristina Holton
Jane PakJane PakRenee RubioRenee Rubio
H. Lee Moffitt Center/USFH. Lee Moffitt Center/USFTimothy J. YeatmanTimothy J. Yeatman
Greg BloomGreg Bloom
<[email protected]><[email protected]>AcknowledgmentsAcknowledgments
http://compbio.dfci.harvard.eduhttp://compbio.dfci.harvard.edu
(Former) Stellar Students(Former) Stellar StudentsMartin AryeeMartin Aryee
Kaveh Maghsoudi Kaveh Maghsoudi Jess MarJess Mar
Systems SupportSystems SupportStas Alekseev, Sys AdminStas Alekseev, Sys Admin
Array Software Hit TeamArray Software Hit TeamKatie FranklinKatie FranklinEleanor HoweEleanor Howe
Sarita NairSarita NairJerry PapenhausenJerry PapenhausenJohn QuackenbushJohn Quackenbush
Dan SchlauchDan SchlauchRaktim SinhaRaktim SinhaJoseph WhiteJoseph White
AssistantAssistantPatricia PapastamosPatricia Papastamos
Center for Cancer Center for Cancer Computational BiologyComputational Biology
Mick CorrellMick CorrellHowie GoodellHowie GoodellKristina HoltonKristina Holton
Jerry PapenhausenJerry PapenhausenPatricia PapastamosPatricia PapastamosJohn QuackenbushJohn Quackenbush
http://cccb.dfci.harvard.eduhttp://cccb.dfci.harvard.edu