[ppt]slide 1 - harvard universitycompbio.dfci.harvard.edu/compbio/education/presentations... · web...

62
Driving Discovery Through Driving Discovery Through Data Integration and Data Integration and Analysis Analysis John Quackenbush John Quackenbush Molecular Diagnostics World 2010 Molecular Diagnostics World 2010 28 October 2010 28 October 2010

Upload: trinhnga

Post on 19-Mar-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Driving Discovery Through Data Driving Discovery Through Data Integration and AnalysisIntegration and Analysis

John QuackenbushJohn QuackenbushMolecular Diagnostics World 2010Molecular Diagnostics World 2010

28 October 201028 October 2010

TreatmentTreatmentOptionsOptions

QualityQualityOf LifeOf Life

GeneticGeneticRiskRisk

EarlyEarlyDetectionDetection

Patient Patient StratificationStratification

DiseaseDiseaseStagingStaging

OutcomesOutcomes

Natural History of DiseaseNatural History of Disease Clinical CareClinical Care

EnvironmentEnvironment + Lifestyle+ Lifestyle

BirthBirth TreatmentTreatment DeathDeath

Disease Progression and Disease Progression and Personalized CarePersonalized Care

BiomarkersBiomarkers

Assure access to samples and rational consentAssure access to samples and rational consent

Develop a technology platformDevelop a technology platform

Make information integration as a central missionMake information integration as a central mission

Conduct research as a vital componentConduct research as a vital component

Present data and information to the local communityPresent data and information to the local community

Enable research beyond your ownEnable research beyond your own

Engage corporate partnersEngage corporate partners

Communicating the mission to the community.Communicating the mission to the community.

Turning the vision into a realityTurning the vision into a reality

Assure Access to SamplesAssure Access to Samples

Patients want to be part of the process of curing diseasePatients want to be part of the process of curing disease

Informed consent needs to be structured to allow patients to Informed consent needs to be structured to allow patients to be partners in the research processbe partners in the research process

HIPPA requires both informed consent and that we assure HIPPA requires both informed consent and that we assure patient confidentialitypatient confidentiality

But “identifiability” is a moving target in a genomic ageBut “identifiability” is a moving target in a genomic age

With the <$1000 genome, in the age of Facebook, what this With the <$1000 genome, in the age of Facebook, what this means remains unclearmeans remains unclear

The new Genomics is a disruptive technology.The new Genomics is a disruptive technology.

Access, Research, SecurityAccess, Research, Security

Develop aDevelop aTechnology PlatformTechnology Platform

2006: State of the Art Sequencing 2006: State of the Art Sequencing

74x Capillary Sequencers74x Capillary Sequencers10 FTEs10 FTEs15-40 runs per day15-40 runs per day1-2Mb per instrument per day1-2Mb per instrument per day120Mb total capacity per day 120Mb total capacity per day

SEQUENCINGSEQUENCING

Rooms of equipmentRooms of equipmentSubcloning > picking > prepping Subcloning > picking > prepping 35 FTEs35 FTEs3-4 weeks3-4 weeks

PRODUCTIONPRODUCTION

Sequencing the genome took ~15 years and $3BSequencing the genome took ~15 years and $3B

2008: Enabling a New Era in Genome 2008: Enabling a New Era in Genome Analysis Analysis

1x Cluster Station1x Cluster Station1 FTE1 FTE1 day1 day

PRODUCTIONPRODUCTION

1x Genome Analyzer1x Genome AnalyzerSame FTE as aboveSame FTE as above1 run per 5 days1 run per 5 days15 Gb per instrument per run15 Gb per instrument per run>3 Gb per day (1x genome coverage) >3 Gb per day (1x genome coverage)

SEQUENCINGSEQUENCING

We can now re-sequence the genome in a ~1 weekWe can now re-sequence the genome in a ~1 week

The ChallengeThe ChallengeNew technologies inspired by the Human Genome New technologies inspired by the Human Genome Project are transforming Project are transforming biomedical research biomedical research from from a laboratory science to an a laboratory science to an information scienceinformation science

We need new approaches to making sense of the We need new approaches to making sense of the data we generatedata we generate

The winners in the race to understand disease are The winners in the race to understand disease are going to be those best able to collect, manage, going to be those best able to collect, manage, analyze, and interpret the data.analyze, and interpret the data.

Make information integration Make information integration as a central missionas a central mission

GeneGene ProteinProteinRNARNA

NetworkNetwork

http://compbio.dfci.harvard.eduhttp://compbio.dfci.harvard.edu

Gene Index Gene Index DatabasesDatabases

ResourcererResourcererOther DatabasesOther Databases

TM4TM4MicroarrayMicroarraySoftwareSoftware

Other toolsOther toolsMeSHerMeSHer

ClusterMedClusterMedBayesian NetsBayesian Nets

DNA MicroarrayDNA MicroarrayAnalysisAnalysis

Candidate Gene(s)Candidate Gene(s)

Perturb Network (RNAi)Perturb Network (RNAi)

Assay Response (Assay Response (A)A)

Predict NetworkPredict Network

PatientPatient

DNA MicroarrayDNA MicroarrayAnalysisAnalysis

CentralCentralWarehouseWarehouse

Other Things:Other Things:Mesoscopic ExpressionMesoscopic ExpressionCorrelated SignaturesCorrelated SignaturesState Space Gene ModelsState Space Gene ModelsTiling Arrays to Genes Tiling Arrays to Genes

Dealing with anDealing with anInformation OverloadInformation Overload

ClinicalClinicalDataData MetabolomicsMetabolomics

ProteomicsProteomicsTranscriptomicsTranscriptomics

CytogenomicsCytogenomics

EpigenomicsEpigenomics

GenomicsGenomics

PublishedPublishedDatasetsDatasets

DrugDrugBankBank

TheTheHapMapHapMap

TheTheGenomeGenome

DiseaseDiseaseDatabasesDatabases

(OMIM)(OMIM)

PubMedPubMed

ClinicalClinicalTrialsTrials

ChemicalChemicalBiologyBiology

Etc.Etc.

Beating Information OverloadBeating Information Overload

CentralCentralWarehouseWarehouse

Improved DiagnosticsImproved DiagnosticsIndividualized TherapiesIndividualized TherapiesMore Effective AgentsMore Effective Agents

PortalsPortals

Web Center PortalWeb Center Portal

CC

  AA BB

DD

FactsFacts

CustomCustom

CC

  AA BB

DD

FactsFacts

Business IntelligenceBusiness Intelligence

Build or BuyBuild or Buy

OracleOracle

ExistingExistingEnterprise Service B

usEnterprise Service B

us

RulesRulesEngineEngine

BPELBPEL

genomicsgenomics

HTB ODSHTB ODS

De-identificationDe-identification MappingMapping

TerminologyTerminology SecuritySecurity

EMPIEMPI

AuditingAuditing

IDXIDX

RxRx

LabLab

Clinical Clinical TrialTrial

…………Dan

a Fa

rber

Clin

ical

Sys

tem

sD

ana

Farb

er C

linic

al S

yste

ms

BAMBAMDashboardDashboard

OMICSOMICS

Dan

a D

ana

Farb

erFa

rber

Lab

Lab

Exte

rnal

Exte

rnal

PartnersPartners

Clinical Clinical PathwaysPathways

Web Service DirectoryWeb Service Directory

Idm &Idm &SecuritySecurity

Severity ScoreSeverity Score……....

RFIDRFID

Exte

rnal

Ex

tern

al

mis

cm

isc PubMedPubMed

GenBankGenBank

Dana-Farber Research DB Conceptual Architecture

An Example: Signature Analysis

Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard

ArrayExpress

GEO

RandomWebsites

Fenglong Liu

Warehouse

GeneChip Oncology DatabaseGeneChip Oncology Database

Fenglong Liu

GeneChip Oncology DatabaseGeneChip Oncology Database

Fenglong Liu

Soon to be replaced by EBI’s

Soon to be replaced by EBI’s

Gene Expression Atlas

Gene Expression Atlas

Analysis

An Example: Signature Analysis

Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard

PubMed

ArrayExpress

GEO

RandomWebsites

Fenglong Liu

KermPicard

Warehouse

In-HouseStudies

GeneSigDB – release 2

http://compbio.dfci.harvard.edu/genesigdbhttp://compbio.dfci.harvard.edu/genesigdb

GeneSigDB – comparing cancersGeneSigDB – comparing cancers

Cancer is a Cell-Cycle Disease

Aedin Culhane, Daniel GusenleitnerAedin Culhane, Daniel Gusenleitner

Breast Cancer has unique signatures

Aedin Culhane, Daniel GusenleitnerAedin Culhane, Daniel Gusenleitner

A sample research question

How many Multiple Myeloma patients, with bone marrow or blood samples in the bank, and who have a chromosome 13 deletion, responded (complete, partial, or minor remission) to therapy and how many did not respond?

A Path ForwardWe are working to develop a two-way strategy for futureWe are working to develop a two-way strategy for future

Clinic → LabClinic → LabLab → ClinicLab → Clinic

Consider OncotypeDxConsider OncotypeDx

This approach represents the intellectual framework for This approach represents the intellectual framework for future success – and the bridges between the various future success – and the bridges between the various laboratories and programs.laboratories and programs.

Conduct research as a vital Conduct research as a vital componentcomponent

Bayesian NetworksBayesian NetworksAmira DjebbariAmira DjebbariRaktin SinhaRaktin SinhaDan SchlauchDan Schlauch

When we say “Networks” we mean…When we say “Networks” we mean…

Genes are represented as “nodes”Genes are represented as “nodes”

Interactions are represented by Interactions are represented by “edges”“edges”

Edges can be directed to show Edges can be directed to show “causal” interactions“causal” interactions

Edges are Edges are not necessarilynot necessarily direct direct interactionsinteractions

Bayesian network - exampleBayesian network - example

Gene1Gene1 Gene2=1|Gene1Gene2=1|Gene1-1-1 0.10.100 0.20.211 0.70.7

Conditional Conditional probability table at probability table at

node “Gene2”node “Gene2”

Edges represent dependenciesEdges represent dependencies

Learning Bayesian networks: Learning Bayesian networks: StructureStructure Conditional probability tablesConditional probability tables

Gene1

Gene4

Gene3Gene2

Bayesian networks - priorsBayesian networks - priorsNo free lunch theorem (Wolpert & MacReady, 1996):No free lunch theorem (Wolpert & MacReady, 1996):

The performance of general-purpose optimization algorithm The performance of general-purpose optimization algorithm iterated on cost function is independent of the algorithm when iterated on cost function is independent of the algorithm when averaged over all cost functions. averaged over all cost functions.

Suggests that when considering a specific application one can Suggests that when considering a specific application one can introduce a introduce a potentially useful bias potentially useful bias using domain knowledgeusing domain knowledge

A low-cost lunch?A low-cost lunch?One can “help” the search along by One can “help” the search along by providing a seed structure representing providing a seed structure representing what we believe is the most likely networkwhat we believe is the most likely networkThe network search process will then use The network search process will then use gene expression data to look for gene expression data to look for perturbations on the structure that are perturbations on the structure that are supported by the datasupported by the dataThere are many possible sources of prior There are many possible sources of prior structures including the Biomedical structures including the Biomedical literature and large-scale interaction studies literature and large-scale interaction studies (PPI)(PPI)

Bayesian networks using Bayesian networks using microarray data and literaturemicroarray data and literature

Test Set: Golub et al. ALL/AML datasetTest Set: Golub et al. ALL/AML dataset

Learn BN with literature network as prior structure, Learn BN with literature network as prior structure, Protein-Protein Interaction data (PPI), and Protein-Protein Interaction data (PPI), and literature+PPIliterature+PPIPerform 200 bootstrap network estimations and find Perform 200 bootstrap network estimations and find links that are “high confidence”links that are “high confidence”Compare without prior (microarray data only)Compare without prior (microarray data only)vs. with prior structure from the literature to look for vs. with prior structure from the literature to look for known interactions.known interactions.

Amira DjebbariAmira Djebbari

BN: No PriorsBN: No Priors

Amira DjebbariAmira Djebbari

BN: PPI DataBN: PPI Data

Amira DjebbariAmira Djebbari

BN: Literature PriorsBN: Literature Priors

Amira DjebbariAmira Djebbari

BN: Literature + PPIBN: Literature + PPI

Cell Cycle Gene SubnetworkCell Cycle Gene Subnetwork

Improving the SeedsImproving the SeedsCo-occurrence does not a provide Co-occurrence does not a provide directionality for interactions, but a directionality for interactions, but a BN is a DAG and our assignment is ad hocBN is a DAG and our assignment is ad hoc

The literature contains information about how The literature contains information about how we the genes (and their products) interactwe the genes (and their products) interact

The challenge is extracting that information The challenge is extracting that information from the literature—there is too much to readfrom the literature—there is too much to read

Text mining doesn’t work well for the Text mining doesn’t work well for the biomedical literature.biomedical literature.

Improving the Seeds (2)Improving the Seeds (2)Solution: Use a hybrid approach!Solution: Use a hybrid approach!

Use text-mining tools to find sentences that Use text-mining tools to find sentences that contain names of two or more genescontain names of two or more genes

Use the Amazon Mechanical Turk to extract Use the Amazon Mechanical Turk to extract [subject]—[predicate]—[object] triples[subject]—[predicate]—[object] triples

Define relationships between genes based on Define relationships between genes based on the “consensus” interactionthe “consensus” interaction

Combine these results with pathway Combine these results with pathway databases to build seed networks.databases to build seed networks.

““PredictiveNetworks” seeds PredictiveNetworks” seeds from the literaturefrom the literature

Present data and information Present data and information to the local communityto the local community

LGRC Research Portal

LGRC Research Portal

PAGE DETAILS

- View aggregate statistics- View cohort details- Build cohort sets- Build composite phenotypes

Actions:

-Go to data download for selected cohort -Go to assay detail for selected cohort-Go to cohort manager

LGRC Research Portal

PAGE DETAILS

Search-Facets-Search within results-Keyword prompts-Search history

Table:-Paged results-Sortable columns

Actions:-Go to Gene detail page-Add genes to ‘gene set’

Gene Expression Summary

RNASeq

PAGE DETAILS

Annotation summary & summary view for each assay/data type:

Accordion style sections

-GEXP – expression profile across major Dx categories-RNASeq – Exon structure of the gene-SNPs – Table of SNPs in region of gene, highlighting association with major Dx group- Methylation – Methylation profile in region around gene-Genomic alterations – table of CNVs & alterations observed w/ freq in region around gene

Actions:- Click through to assay detail page-Add gene to set

Annotation Summary

LGRC Research Portal

Analysis ToolsAnalysis Tools

PAGE DETAILSPAGE DETAILS

-Very minimal parameters and Very minimal parameters and options…here just 2 cohorts of options…here just 2 cohorts of interest, maybe p-value cutoff interest, maybe p-value cutoff

Generates comprehensive reportGenerates comprehensive report

Edit in place results – Don’t set Edit in place results – Don’t set parameters, edit the resultsparameters, edit the results

Analysis goes into queue, email Analysis goes into queue, email notification when finishednotification when finished

Cohort 1:Cohort 1:

Cohort 2:Cohort 2:

Set 1Set 1

Set 2Set 2

Start AnalysisStart AnalysisView analysis parametersView analysis parameters

Job StatusJob Status RunningRunning

Job name: Job name: My job 1My job 1

Analysis of Differential Expression: My Job 1

Supervised Analysis

Meta analysis

Unsupervised analysis

PAGE DETAILS

-Very minimal parameters and options.

Generates comprehensive report

Edit in place results – Don’t set parameters, edit the results

Accordion style result sections

Generate PDF report of analysis

Analysis goes into queue, email notification when finished

Engage corporate partnersEngage corporate partners

We received an $1M Oracle Commitment grant to create We received an $1M Oracle Commitment grant to create our integrated clinical/research data warehouseour integrated clinical/research data warehouse

We’ve partnered with IDBS to create data portalsWe’ve partnered with IDBS to create data portals

We are working with Illumina on a variety of projectsWe are working with Illumina on a variety of projects

We are forging relationships with Thomson-Reuters to link We are forging relationships with Thomson-Reuters to link genomic profiling data to drug, trial, and patent informationgenomic profiling data to drug, trial, and patent information

We are building partnerships with Roche, Genomatix, We are building partnerships with Roche, Genomatix, NEB, and others interested in entering the personal NEB, and others interested in entering the personal genomics space.genomics space.

We need to find the best toolsWe need to find the best tools

Enable research beyondEnable research beyondyour ownyour own

John Quackenbush, DirectorMick Correll, Associate Director

The Mission

The mission of the CCCB is to provide broad-based support for the analysis and interpretation of ‘omic data and in doing so to further basic, clinical and translational research. CCCB also will conduct research that opens new ways of understanding cancer.

CCCB Service Offering

IT Infrastructure-Application hosting-Data management-Custom software development-Comprehensive collaboration portals

CCCB Service Offering

Next-Gen Sequencing -Competitive per-lane pricing-Integrated informatics-Major focus for development in 2010

IT In

frast

ruct

ure

CCCB Service Offering

Analytical Consulting -Bioinformatics / statistical data analysis-Experimental design-Value-add for IT/Sequencing services

IT In

frast

ruct

ure

Seq

uenc

ing

CCCB Collaborative Consulting Model

1. Initial meeting to understand project scope and objectives

2. Development of an analysis plan and time/cost estimate

3. During project execution, data and results are exchanged through a secure, password-protected collaboration portal

4. Available as ad-hoc service, or larger scale support agreementsIT In

frast

ruct

ure

Seq

uenc

ing

Consulting

Communicate the mission to Communicate the mission to the community.the community.

The LGRCThe LGRC

Genomics is here to stayGenomics is here to stay

The Gene Index TeamThe Gene Index TeamCorina AntonescuCorina Antonescu

Valentin AntonescuValentin AntonescuFenglong LiuFenglong LiuGeo PerteaGeo Pertea

Razvan SultanaRazvan SultanaJohn QuackenbushJohn Quackenbush

Microarray Expression TeamMicroarray Expression Team Stefan BentinkStefan Bentink

Thomas ChittendenThomas ChittendenAedin CulhaneAedin CulhaneKristina HoltonKristina Holton

Jane PakJane PakRenee RubioRenee Rubio

H. Lee Moffitt Center/USFH. Lee Moffitt Center/USFTimothy J. YeatmanTimothy J. Yeatman

Greg BloomGreg Bloom

<[email protected]><[email protected]>AcknowledgmentsAcknowledgments

http://compbio.dfci.harvard.eduhttp://compbio.dfci.harvard.edu

(Former) Stellar Students(Former) Stellar StudentsMartin AryeeMartin Aryee

Kaveh Maghsoudi Kaveh Maghsoudi Jess MarJess Mar

Systems SupportSystems SupportStas Alekseev, Sys AdminStas Alekseev, Sys Admin

Array Software Hit TeamArray Software Hit TeamKatie FranklinKatie FranklinEleanor HoweEleanor Howe

Sarita NairSarita NairJerry PapenhausenJerry PapenhausenJohn QuackenbushJohn Quackenbush

Dan SchlauchDan SchlauchRaktim SinhaRaktim SinhaJoseph WhiteJoseph White

AssistantAssistantPatricia PapastamosPatricia Papastamos

Center for Cancer Center for Cancer Computational BiologyComputational Biology

Mick CorrellMick CorrellHowie GoodellHowie GoodellKristina HoltonKristina Holton

Jerry PapenhausenJerry PapenhausenPatricia PapastamosPatricia PapastamosJohn QuackenbushJohn Quackenbush

http://cccb.dfci.harvard.eduhttp://cccb.dfci.harvard.edu