izant openscience

Why we do it

Jonathan IzantVP, Sage Bionetworks

Open Science Summit 31 July 2010www.sagebase.org

denial

Genomics does not yet teach us much

Pharma drug development is broken

Standards of care are inadequate

Academics limit open access

Genetics Timeline

1800 1900 2000

Gene Regulation circa 1990

DNAVariation

DNAVariation

Complex TraitVariation

Molecular TraitVariation

trait

“Standard” GWAS Approaches Profiling Approaches

“Integrated” Genetics Approaches

Genome scale profiling provide correlates of disease Many examples BUT what is cause and effect?

Identifies Causative DNA Variation but provides NO mechanism

Provide unbiased view of molecular physiology as it relates to disease phenotypes

Insights on mechanism Provide causal relationships

and allows predictions

RNA amplificationMicroarray hybirdization

Gene Index

Tum

ors

Tum

ors

8

How is genomic data used to understand biology?

Merck Inc. Co.5 Year ProgramBased at RosettaTotal Resources >$150M

The “Rosetta Integrative Genomics Experiment”: Generation, assembly, and integration of data to build models that

predict clinical outcome

• Generate data needed to build bionetworks• Assemble other available data useful for building networks• Integrate and build models• Test predictions• Develop treatments• Design Predictive Markers

Constructing Bayesian Networks

"Genetics of gene expression surveyed in maize, mouse and man." Nature. (2003)

"Variations in DNA elucidate molecular networks that cause disease." Nature. (2008)

"Genetics of gene expression and its effect on disease." Nature. (2008)

"Validation of candidate causal genes for obesity that affect..." Nat Genet. (2009) ….. Plus 10 additional papers in Genome Research, PLoS Genetics, PLoS Comp.Biology, etc

"Identification of pathways for atherosclerosis." Circ Res. (2007)

"Mapping the genetic architecture of gene expression in human liver." PLoS Biol. (2008)

…… Plus 5 additional papers in Genome Res., Genomics, Mamm.Genome

"Integrating genotypic and expression data …for bone traits…" Nat Genet. (2005)

“..approach to identify candidate genes regulating BMD…" J Bone Miner Res. (2009)

"An integrative genomics approach to infer causal associations ...” Nat Genet. (2005)

"Increasing the power to detect causal associations… “PLoS Comput Biol. (2007)

"Integrating large-scale functional genomic data ..." Nat Genet. (2008) …… Plus 3 additional papers in PLoS Genet., BMC Genet.

d

Metabolic Disease

CVD

Bone

Methods

Extensive Publications now Substantiating Scientific ApproachProbabilistic Causal Bionetwork Models

• >60 Publications from Rosetta Genetics Group (~30 scientists) over 5 years including high profile papers in PLoS Nature and Nature Genetics

Opportunity

The stunning technologies coming will generate heaps of genomic data

Bionetworks using integrative genomic approaches can highlight the non-redundant components- can find drivers of the disease and of therapies

Need to develop ways to host massive amounts of data, evolving representations of disease as represented by these probabilistic causal disease models

Drivers

Recognition that the benefits of bionetwork based molecular models of diseases are powerful but that they require significant resources

Appreciation that it will require decades of evolving representations as real complexity emerges and needs to be integrated with therapeutic interventions

Realizing the donation by Merck might seed a “commons” allowing a potential long term gain to the whole community provided by evolving models of disease built via a contributor network

14

Mission

Sage Bionetworks is a non-profit organization with a vision to create a “Commons” where integrative bionetworks are evolved by contributor scientists with a shared vision to accelerate the elimination of human disease

Sage Bionetworks:a busy first year

2009 2010

14 Staff move into Sage Offices

at FHCRC

First Board of Directors Meeting

First NIH grant payment

Catalyst Funding from Listwin, CHDI

and QuintilesNIH New

Institution Review

Partnership with Pfizer

Partnership with Merck

$5m LSDF Grant

1st Sage Commons

Congress in SF

501(c)(3)determination

$8m NCI grant for new CCSB

Sage Bionetworks Partners

Rese

arch

Platform

Training

Global Coherent Data SetsA data set containing genome-wide DNA variation and intermediate trait, as well as physiological phenotype data across a population of individuals large enough to power association or linkage studies, typically 50 or more individuals. To be coherent, the data needs to be matched with consistent identifiers. Intermediate traits are typically gene expression, but may also include proteomic, metabolomic, and other molecular data.

GCDs are current state of knowledge and subject to change as more information becomes available to Sage

http://www.sagebase.org/research/tools.html

http://www.sagebase.org/research/tools.html

Sage Commons Challenges

Standards (data, annotation)

Tools (combining, analyzing)

Citation (recognition)

Internationalization

Public Engagement

consistent data format and metadata

building the critical mass of contributors

Data standartization, Data Quality

enormous curation effort needed to correct for incompatible study designs, incomplete data gathering

IRB and protection of human subjects

Data interoperability

legal/licensing framework

Tools and standards: allow the reosuce to gown and evolve, capture metadata in a standardized way and quality measures and quality control

Visualization tools

platform independence

Designing a simple-to-use model for uploading and processing data

Ability to capture structured content

The Commons will need to resolve issues surrounding protection of human subjects data if the information is to be widely shared.

Barriers:

The person/institution that was funded to generate the data

The Journals where it was published

The funding agency, regulated by agency rules

Government agency (e.g. NCBI, EBI)

Institutions who want to generate intellectual property

The patients who were studied

A non-profit public access organization

Hospitals and healthcare organizations

A commercial IT, biotechnology or pharmaceutical company

Other (please specify)

0% 20% 40% 60% 80% 100%

One year after it is generated, where is most clinical / genomic

data stored? (87 respondents, multiple choices

permitted)

Problem: ‘Accessible’ data often isn’t

more than 90%

between 50% and 90%

about 50% between 50% and 10%

less than 10%0%

20%

40%

60%

80%

Question: What percentage of the clinical/genomic data that has been published is currently readily accessible for researchers to use?

Question: What percentage of published clinical/genomic data is currently available in a format that is easily down-loaded in a way that facilitates new analysis?

Collaborators

Biomedical research developed as a Cottage Industry

Need for multi-layer mega datasets and the vanishing ‘price’ for genes

provides incentive for pre-competitive space for genomics

1980 1990 2000 2010 2020 $100,000

$1,000,000

$10,000,000

$100,000,000

$1,000,000,000

Gene Licensing Deals ($US)

Incentives:

Researcher "Turf" /lack of experience sharing

Business case for contributing and sharing resources and information is unclear to many, while business case for hoarding them is well articulated and obvious.

Buy-in from tool developers, data producers and data users

politic: competitive funding versus communal goal

Sociology and policy. Getting people to share and building trust.

Willingness by the community to share data and key ancillary information (e.g. pathology/clinical data for profiled samples)Changing culture of individual recognition,

publication, rewards, incentives

This is a social (political) experiment/ entreprise as much as a scientific challenge. How to motivate individuals not community inclined might be key.

The theory is great, the practice needs commitment from a wide variety of players

IMHO, the central challenge will be community adoption.

We need a team that will take the time to make sure we create a set of tools that can interoperate , rather than a set of tools that perform discrete independent tasks.

Andrea CalifanoColumbia U.Eric Schadt

PacBio - UCSF

Atul ButteStanford Med

Trey IdekerUCSD

Stephen FriendSage Bionetworks

The Federation Experiment

Sage Bionetworks

Focused on improving treatment of disease

Working through extensive partnerships to enable research and drug development

Cultural challenges may eclipse technical and operational hurtles

www.sagebase.org

izant openscience

Documents

data