the thinking behind big data at the nih

41
The Thinking Behind Big Data at the NIH Philip E. Bourne Ph.D. Associate Director for Data Science National Institutes of Health http://www.slideshare.net/pebourne/

Upload: philip-bourne

Post on 06-May-2015

1.014 views

Category:

Education


1 download

DESCRIPTION

Presented at the Big Data in Biomedicine Conference at Stanford University May 21, 2014

TRANSCRIPT

Page 1: The Thinking Behind Big Data at the NIH

The Thinking Behind Big Data at the NIHPhilip E. Bourne Ph.D.

Associate Director for Data ScienceNational Institutes of Health

http://www.slideshare.net/pebourne/

Page 2: The Thinking Behind Big Data at the NIH

Disclaimer: I only started March 3, 2014

…but I and others had been thinking about this prior to my appointment

Page 3: The Thinking Behind Big Data at the NIH

Let me start with a few examples of what motivates our thinking …

Page 4: The Thinking Behind Big Data at the NIH

The Story of Meredith

http://fora.tv/2012/04/20/Congress_Unplugged_Phil_Bourne

Stephen Friend

Page 5: The Thinking Behind Big Data at the NIH

We have Entered An Era of Deinstitutionalize & Democratization

of Science

Daniel Hulshizer/Associated Press

Page 6: The Thinking Behind Big Data at the NIH

We have Entered An Era of Deinstitutionalize & Democratization

of Science – NIH Should Support This

Daniel Hulshizer/Associated Press

Page 7: The Thinking Behind Big Data at the NIH

I can’t reproduce research from my own laboratory?

Daniel Garijo et al. 2013 Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome PLOS ONE 8(11) e80278 .

Can you?

But what does it take and does it matter?

Page 8: The Thinking Behind Big Data at the NIH

47/53 “landmark” publications could not be replicated

[Begley, Ellis Nature, 483, 2012] [Carole Goble]

Page 9: The Thinking Behind Big Data at the NIH

Reproducibility Studies Are On-going Across the NIH

Expected outcomes:– Improved accessibility to data and software

– Support for workflows

– Closer relationships with publishers

– Metrics for measuring reproducibility

– Closure of the research lifecycle loop

– Rewards for reproducibility

Page 10: The Thinking Behind Big Data at the NIH

You will notice that so far none of these issues has to do with “Big Data”

per se

“Big Data” has simply bought more attention to these issues

Page 11: The Thinking Behind Big Data at the NIH

What Worries Me the Most - Sustainability

Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830

Page 12: The Thinking Behind Big Data at the NIH

We Cant Go On Like This – Some Options

Introduction of business models– The 50% model

– Mergers

– Acquisitions associated with best practices

– Centralization

– Public/private partnerships

– Fee for service

– Archiving

Usage metrics / impact ….

Page 13: The Thinking Behind Big Data at the NIH

We don’t know enough about how current data

are used!

* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm

Jan. 2008 Jan. 2009 Jan. 2010Jul. 2009Jul. 2008 Jul. 2010

1RUZ: 1918 H1 Hemagglutinin

Structure Summary page activity forH1N1 Influenza related structures

3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir

[Andreas Prlic]

Page 14: The Thinking Behind Big Data at the NIH

Ironic Since Some Industries Thrive By Asking These Questions

Page 15: The Thinking Behind Big Data at the NIH

And This May Just be the Beginning

Evidence:– Google car

– 3D printers

– Waze

– Robotics

From: The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies by Erik Brynjolfsson & Andrew McAfee

Page 16: The Thinking Behind Big Data at the NIH

Scholarship is broken

I have a paper with 16,000 citations that no one has ever read

I have papers in PLOS ONE that have more citations than ones in PNAS

I have data sets I am proud of few places to put them

I edited a journal but it did not count for much

Page 17: The Thinking Behind Big Data at the NIH

The reward system is in need of repair

Page 18: The Thinking Behind Big Data at the NIH

Okay… enough of the problems

What are some solutions?

Page 19: The Thinking Behind Big Data at the NIH

Approach to Solutions

New policies, e.g. data sharing, blanket consent

Funding where it is most needed– New metrics

– De-identification

– Agile pilots

– Smaller funding for the many, but with appropriate governance

– Competitions

– Coordination across agencies and countries

Shared infrastructure

Support for new reward systems

Page 20: The Thinking Behind Big Data at the NIH

How We Are Starting to Organize Ourselves

Page 21: The Thinking Behind Big Data at the NIH

Associate Director for Data Science

CommonsTrainingCenter

BD2KModifiedReview

Sustainability* Education* Innovation* Process

• Cloud – Data & Compute

• Search• Security • Reproducibility

Standards• App Store

• Coordinate• Hands-on• Syllabus• MOOCs

• Community• Centers• Training Grants• Catalogs• Standards• Analysis

• Data Resource Support

• Metrics• Best

Practices• Evaluation• Portfolio

Analysis

The Biomedical Research Digital Enterprise

Communication

Collaboration

Programmatic Theme

Deliverable

Example Features • IC’s• Researchers• Federal

Agencies• International

Partners• Computer

Scientists

Scientific Data Council External Advisory Board

* Hires made

Page 22: The Thinking Behind Big Data at the NIH

Solution: The Power of the Commons

Data

The Long Tail

Core Facilities/HS Centers

Clinical /Patient

The Why:Data Sharing Plans

TheCommons

Government

The How:

DataDiscoveryIndex

SustainableStorage

Quality

Scientific Discovery

Usability

Security/Privacy

Commons == Extramural NCBI == Research Object Sandbox == Collaborative Environment

The End Game:

KnowledgeNIHAwardees

PrivateSector

Metrics/Standards

Rest ofAcademia

Software StandardsIndex

BD2KCenters

Cloud, Research Objects,Business Models

Page 23: The Thinking Behind Big Data at the NIH

What Does the Commons Enable?

Dropbox like storage

The opportunity to apply quality metrics

Bring compute to the data

A place to collaborate

A place to discover

http://100plus.com/wp-content/uploads/Data-Commons-3-1024x825.png

Page 24: The Thinking Behind Big Data at the NIH

Commons Timeline

Spring/Summer 2014: DS group are gathering information about activities and needs from ICs (and outside communities).– Shared interests in developing cloud-based biomedical

commons.

– Investigating potential models of sustainability.

– Exploring metrics of usefulness and success.

Fall 2014: Develop possible pilots to explore options in addition to those already being implemented by some ICs.

Page 25: The Thinking Behind Big Data at the NIH

Associate Director for Data Science

CommonsTrainingCenter

BD2KModifiedReview

Sustainability* Education* Innovation* Process

• Cloud – Data & Compute

• Search• Security • Reproducibility

Standards• App Store

• Coordinate• Hands-on• Syllabus• MOOCs

• Community• Centers• Training Grants• Catalogs• Standards• Analysis

• Data Resource Support

• Metrics• Best

Practices• Evaluation• Portfolio

Analysis

The Biomedical Research Digital Enterprise

Communication

Collaboration

Programmatic Theme

Deliverable

Example Features • IC’s• Researchers• Federal

Agencies• International

Partners• Computer

Scientists

Scientific Data Council External Advisory Board

* Hires made

Page 26: The Thinking Behind Big Data at the NIH

TrainingTraining

Summary of Training Workshop and Request for Information:

– http://bd2k.nih.gov/faqs_trainingFOA.html

– Contact: Michelle Dunn (NCI)

Training Goals:

– develop a sufficient cadre of researchers skilled in the science of Big Data

– elevate general competencies in data usage and analysis across the biomedical research workforce.

Page 27: The Thinking Behind Big Data at the NIH

BD2K Training RFAsBD2K Training RFAs K01s for Mentored Career Development Awards,

RFA-HG-14-007

Provides salary and research support for 3-5 years for intensive research career development under the guidance of an experienced mentor in biomedical Big Data Science.

R25s for Courses for Skills Development, RFA-HG-14-008

Development of creative educational activities with a primary focus on Courses for Skills Development.

R25 for Open Educational Resources, RFA-HG-14-009

Development of open educational resources (OER) for use by large numbers of learners at all career levels, with a primary focus on Curriculum or Methods Development.

Page 28: The Thinking Behind Big Data at the NIH

Contemplating CSHL style training center(s)

Page 29: The Thinking Behind Big Data at the NIH

Associate Director for Data Science

CommonsTrainingCenter

BD2KModifiedReview

Sustainability* Education* Innovation* Process

• Cloud – Data & Compute

• Search• Security • Reproducibility

Standards• App Store

• Coordinate• Hands-on• Syllabus• MOOCs

• Community• Centers• Training Grants• Catalogs• Standards• Analysis

• Data Resource Support

• Metrics• Best

Practices• Evaluation• Portfolio

Analysis

The Biomedical Research Digital Enterprise

Communication

Collaboration

Programmatic Theme

Deliverable

Example Features • IC’s• Researchers• Federal

Agencies• International

Partners• Computer

Scientists

Scientific Data Council External Advisory Board

* Hires made

Page 30: The Thinking Behind Big Data at the NIH

BD2K InnovationBD2K Innovation

Data Discovery Index Coordination Consortium (U24) (closed)

Metadata standards (under development) Targeted Software Development

Development of Software and Analysis Methods for Biomedical Big Data in Targeted Areas of High Need (U01)–RFA-HG-14-020

–Application receipt date June 20, 2014

–Topics: data compression/reduction, visualization, provenance, or wrangling.

–Contact: Jennifer Couch (NCI) and Dave Miller (NCI)

Page 31: The Thinking Behind Big Data at the NIH

BD2K InnovationBD2K Innovation

BISTI PARs – BISTI: Biomedical Information Science and Technology

Initiative

– Joint BISTI-BD2K effort

– R01s and SBIRs

– Contacts: Peter Lyster (NIGMS) and Jennifer Couch (NCI)

Workshops:– Software Index (Last week)

• Need to be able to find and cite software, as well as data, to support reproducible science.

– Cloud Computing (Summer/Fall 2014)• Biomedical big data are becoming too large to be analyzed on

traditional localized computing systems.

– Contact: Vivien Bonazzi (NHGRI)

Page 32: The Thinking Behind Big Data at the NIH

BD2K Innovation CentersBD2K Innovation Centers

FY14 Investigator-initiated Centers of Excellence for Big

Data Computing in the Biomedical Sciences (U54) RFA-HG-13-009 (closed)

BD2K-LINCS-Perturbation Data Coordination and Integration Center (DCIC) (U54) RFA-HG-14-001 (closed)

Page 33: The Thinking Behind Big Data at the NIH

Associate Director for Data Science

CommonsTrainingCenter

BD2KModifiedReview

Sustainability* Education* Innovation* Process

• Cloud – Data & Compute

• Search• Security • Reproducibility

Standards• App Store

• Coordinate• Hands-on• Syllabus• MOOCs

• Community• Centers• Training Grants• Catalogs• Standards• Analysis

• Data Resource Support

• Metrics• Best

Practices• Evaluation• Portfolio

Analysis

The Biomedical Research Digital Enterprise

Communication

Collaboration

Programmatic Theme

Deliverable

Example Features • IC’s• Researchers• Federal

Agencies• International

Partners• Computer

Scientists

Scientific Data Council External Advisory Board

* Hires made

Page 34: The Thinking Behind Big Data at the NIH

Some Thoughts About Process

Machine readable data sharing plans?

Open review?

Micro funding?

Standing data committees to explore best practices?

Crowd sourcing?

Page 35: The Thinking Behind Big Data at the NIH

Associate Director for Data Science

CommonsTrainingCenter

BD2KModifiedReview

Sustainability* Education* Innovation* Process

• Cloud – Data & Compute

• Search• Security • Reproducibility

Standards• App Store

• Coordinate• Hands-on• Syllabus• MOOCs

• Community• Centers• Training Grants• Catalogs• Standards• Analysis

• Data Resource Support

• Metrics• Best

Practices• Evaluation• Portfolio

Analysis

The Biomedical Research Digital Enterprise

Communication

Collaboration

Programmatic Theme

Deliverable

Example Features • IC’s• Researchers• Federal

Agencies• International

Partners• Computer

Scientists

Scientific Data Council External Advisory Board

* Hires made

Page 36: The Thinking Behind Big Data at the NIH

Where Do We Want to End Up?

Page 37: The Thinking Behind Big Data at the NIH

Associate Director for Data Science

CommonsTrainingCenter

BD2KModifiedReview

Sustainability* Education* Innovation* Process

• Cloud – Data & Compute

• Search• Security • Reproducibility

Standards• App Store

• Coordinate• Hands-on• Syllabus• MOOCs

• Community• Centers• Training Grants• Catalogs• Standards• Analysis

• Data Resource Support

• Metrics• Best

Practices• Evaluation• Portfolio

Analysis

The Biomedical Research Digital Enterprise

Communication

Collaboration

Programmatic Theme

Deliverable

Example Features • IC’s• Researchers• Federal

Agencies• International

Partners• Computer

Scientists

Scientific Data Council External Advisory Board

* Hires made

Page 38: The Thinking Behind Big Data at the NIH

Components of The Academic Digital Enterprise

Consists of digital assets– E.g. datasets, papers, software, lab notes

Each asset is uniquely identified and has provenance, including access control– E.g. publishing simply involves changing the access control

Digital assets are interoperable across the enterprise

Page 39: The Thinking Behind Big Data at the NIH

Life in the Academic Digital Enterprise

Jane scores extremely well in parts of her graduate on-line neurology class. Neurology professors, whose research profiles are on-line and well described, are automatically notified of Jane’s potential based on a computer analysis of her scores against the background interests of the neuroscience professors. Consequently, professor Smith interviews Jane and offers her a research rotation. During the rotation she enters details of her experiments related to understanding a widespread neurodegenerative disease in an on-line laboratory notebook kept in a shared on-line research space – an institutional resource where stakeholders provide metadata, including access rights and provenance beyond that available in a commercial offering. According to Jane’s preferences, the underlying computer system may automatically bring to Jane’s attention Jack, a graduate student in the chemistry department whose notebook reveals he is working on using bacteria for purposes of toxic waste cleanup. Why the connection? They reference the same gene a number of times in their notes, which is of interest to two very different disciplines – neurology and environmental sciences. In the analog academic health center they would never have discovered each other, but thanks to the Digital Enterprise, pooled knowledge can lead to a distinct advantage. The collaboration results in the discovery of a homologous human gene product as a putative target in treating the neurodegenerative disorder. A new chemical entity is developed and patented. Accordingly, by automatically matching details of the innovation with biotech companies worldwide that might have potential interest, a licensee is found. The licensee hires Jack to continue working on the project. Jane joins Joe’s laboratory, and he hires another student using the revenue from the license. The research continues and leads to a federal grant award. The students are employed, further research is supported and in time societal benefit arises from the technology.

From What Big Data Means to Me JAMIA 2014 21:194

Page 40: The Thinking Behind Big Data at the NIH

Some Acknowledgements

Eric Green & Mark Guyer (NHGRI)

Jennie Larkin (NHLBI)

Vivien Bonazzi (NHGRI)

Michelle Dunn (NCI)

Mike Huerta (NLM)

David Lipman (NLM)

Jim Ostell (NLM)

Peter Lyster (NIGMS)

All the over 100 folks on the BD2K team

Page 41: The Thinking Behind Big Data at the NIH

NIHNIH……Turning Discovery Into HealthTurning Discovery Into Health

[email protected]